../_images/smdrqa_logo1.svg

cross_validation#

Features#

feature_selection()#

feature_selection(x_train, y_train, features, y_name, inner_splits, select_k=False, repeats=5)[source]#

Function used for feature selection using best subset setection and average cross validation accuracy

Parameters:
  • x_train (array) – training set from the outer loops of nested cross validation function

  • y_train (array) – output variables for the training set from the outer loops of the nested cross validation

  • features (list) – total set of features from which features are being selected

  • y_name (str) – column name for the outcome variable

  • inner_splits (int) – cross validation splits for feature selection

  • select_k (bool) – whether the user wants to tune k also based on training set, default = False, k=5

  • repeats (int) – number of repeats for cross validation for feature selection

Returns:

  • best_features (list) – list of features got selected

  • best_score (double) – average accuracy for the best feature subset

  • best_roc_auc (double) – average ROC AUC for the best feature subset

References

  • Scikit-Learn Nested Cross-Validation Example

  • Cawley, Gavin C., and Nicola LC Talbot. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” The Journal of Machine Learning Research 11 (2010): 2079-2107.

nested_cv()#

nested_cv(data_file, features, y_name, outname, repeats=1000, inner_repeats=10, outer_splits=3, inner_splits=2)[source]#

This is a function to run nested cross validation. The outer loop is for evaluation and inner loop for feature selection

Parameters:
  • data_file (dataframe) – pandas dataframe containing input data

  • features (list) – total set of features from which features are being selected

  • y_name (str) – column name for the outcome variable

  • outname (str) – name to be included in the output csv file’s name

  • outer_splits (int) – number of cross validation splits for the outer loop

  • inner_splits (int) – number of cross validation splits for the inner loop

  • repeats (int) – number of repeats for the outer loop

  • inner_repeats (int) – number of repeats for the inner loop(feature selection)

Returns:

file (CSV) – containing cross validation accuracy and ROC AUC for each of the outer loops

References

  • Scikit-Learn Nested Cross-Validation Example

  • Cawley, Gavin C., and Nicola LC Talbot. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” The Journal of Machine Learning Research 11 (2010): 2079-2107.