cross_validation#

Features#

feature_selection(x_train, y_train, features, y_name, inner_splits, select_k=False, repeats=5)[source]#

Function used for feature selection using best subset setection and average cross validation accuracy

Parameters:

x_train (array) – training set from the outer loops of nested cross validation function
y_train (array) – output variables for the training set from the outer loops of the nested cross validation
features (list) – total set of features from which features are being selected
y_name (str) – column name for the outcome variable
inner_splits (int) – cross validation splits for feature selection
select_k (bool) – whether the user wants to tune k also based on training set, default = False, k=5
repeats (int) – number of repeats for cross validation for feature selection

Returns:

References

Scikit-Learn Nested Cross-Validation Example
Cawley, Gavin C., and Nicola LC Talbot. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” The Journal of Machine Learning Research 11 (2010): 2079-2107.

nested_cv(data_file, features, y_name, outname, repeats=1000, inner_repeats=10, outer_splits=3, inner_splits=2)[source]#

This is a function to run nested cross validation. The outer loop is for evaluation and inner loop for feature selection

Parameters:

data_file (dataframe) – pandas dataframe containing input data
features (list) – total set of features from which features are being selected
y_name (str) – column name for the outcome variable
outname (str) – name to be included in the output csv file’s name
outer_splits (int) – number of cross validation splits for the outer loop
inner_splits (int) – number of cross validation splits for the inner loop
repeats (int) – number of repeats for the outer loop
inner_repeats (int) – number of repeats for the inner loop(feature selection)

Returns:

file (CSV) – containing cross validation accuracy and ROC AUC for each of the outer loops

References

Scikit-Learn Nested Cross-Validation Example
Cawley, Gavin C., and Nicola LC Talbot. “On over-fitting in model selection and subsequent selection bias in performance evaluation.” The Journal of Machine Learning Research 11 (2010): 2079-2107.