mlshell.model_selection

The mlshell.model_selection contains hyper-parameters tuning utils.

Functions

cross_val_predict(*args, **kwargs)

Extended sklearn.model_selection.cross_val_predict().

Classes

MockClassifier()

Estimator always predicts train feature.

MockOptimizer(pipeline, hp_grid, scoring[, …])

Threshold optimizer.

MockRegressor()

Estimator always predicts features.

Optimizer()

Unified optimizer interface.

PredictionTransformer(classifier)

Transformer applies predict_proba on features.

RandomizedSearchOptimizer(pipeline, hp_grid, …)

Wrapper around sklearn.model_selection.RandomizedSearchCV.

Resolver()

Resolve dataset-related pipeline hyper-parameter.

ThresholdClassifier(params[, threshold])

Estimator applies classification threshold.

Validator()

Validate fitted pipeline.

class mlshell.model_selection.PredictionTransformer(classifier)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin, sklearn.base.MetaEstimatorMixin

Transformer applies predict_proba on features.

Parameters

classifier (classifier object) – Classifier supported predict_proba.

Methods

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit

transform

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –

  • y (ndarray of shape (n_samples,), default=None) – Target values.

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

class mlshell.model_selection.ThresholdClassifier(params, threshold=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Estimator applies classification threshold.

Classify samples based on whether they are above of below threshold. Awaits for prediction probabilities in features.

Parameters
  • params (dict) –

    Parameters combined in dictionary to set together. {

    ‘classes’: list of numpy.ndarray

    List of sorted unique labels for each target(s) (n_outputs, n_classes).

    ’pos_labels’: list

    List of “positive” label(s) for target(s) (n_outputs,).

    ’pos_labels_ind’: list

    List of “positive” label(s) index in np.unique(target) for target(s) (n_outputs).

    }

  • threshold (float [0,1], list of float [0,1], optional(default=None)) – Classification threshold. For multi-output target list of [n_outputs]. If None, numpy.argmax() (in binary case equivalent to 0.5). If positive class probability P(pos_label) = 1 - P(neg_labels) > th_ for some sample, classifier predict pos_label, else label in neg_labels with max probability.

Notes

Will be replaced with:

https://github.com/scikit-learn/scikit-learn/pull/16525.

Attributes
classes_

Methods

get_params([deep])

Get parameters for this estimator.

score(X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_params(**params)

Set the parameters of this estimator.

fit

predict

predict_proba

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

class mlshell.model_selection.MockClassifier

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Estimator always predicts train feature.

Methods

get_params([deep])

Get parameters for this estimator.

score(X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_params(**params)

Set the parameters of this estimator.

fit

predict

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

class mlshell.model_selection.MockRegressor

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Estimator always predicts features.

Methods

get_params([deep])

Get parameters for this estimator.

score(X, y[, sample_weight])

Return the coefficient of determination R^2 of the prediction.

set_params(**params)

Set the parameters of this estimator.

fit

predict

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

score(X, y, sample_weight=None)

Return the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – R^2 of self.predict(X) wrt. y.

Return type

float

Notes

The R2 score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

class mlshell.model_selection.Optimizer

Bases: object

Unified optimizer interface.

Implements interface to access arbitrary optimizer. Interface: dump_runs, update_best and all underlying optimizer methods.

optimizer

Underlying optimizer.

Type

sklearn.model_selection.BaseSearchCV

Notes

Calling unspecified methods are redirected to underlying optimizer object.

Methods

dump_runs(logger, dirpath, pipeline, …)

Dump results.

update_best(prev)

Combine results from multi-stage optimization.

update_best(prev)

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict

List of cv_results_['params'] for all runs in stages.

best_params_dict

Best estimator tuned params from all optimization stages.

best_estimator_sklearn estimator

Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

best_score_tuple

Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

  • best_estimator_’ key to update pipeline in objects.

  • ‘params’ in built-in plotter.

  • best_score_’ in dump/dump_pred for file names.

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)

Dump results.

Parameters
  • logger (logging.Logger) – Logger.

  • dirpath (str) – Absolute path to dump dir.

  • pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.

  • dataset (mlshell.Dataset) – Dataset used for optimizer.fit.

  • **kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

  • ‘id’ random UUID for run (hp combination).

  • All pipeline parameters.

  • Grid search output runs keys.

  • Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.

  • Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

class mlshell.model_selection.RandomizedSearchOptimizer(pipeline, hp_grid, scoring, **kwargs)

Bases: mlshell.blocks.model_selection.search.Optimizer

Wrapper around sklearn.model_selection.RandomizedSearchCV.

Parameters

Methods

dump_runs(logger, dirpath, pipeline, …)

Dump results.

update_best(prev)

Combine results from multi-stage optimization.

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)

Dump results.

Parameters
  • logger (logging.Logger) – Logger.

  • dirpath (str) – Absolute path to dump dir.

  • pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.

  • dataset (mlshell.Dataset) – Dataset used for optimizer.fit.

  • **kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

  • ‘id’ random UUID for run (hp combination).

  • All pipeline parameters.

  • Grid search output runs keys.

  • Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.

  • Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

update_best(prev)

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict

List of cv_results_['params'] for all runs in stages.

best_params_dict

Best estimator tuned params from all optimization stages.

best_estimator_sklearn estimator

Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

best_score_tuple

Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

  • best_estimator_’ key to update pipeline in objects.

  • ‘params’ in built-in plotter.

  • best_score_’ in dump/dump_pred for file names.

class mlshell.model_selection.MockOptimizer(pipeline, hp_grid, scoring, method='predict', **kwargs)

Bases: mlshell.blocks.model_selection.search.RandomizedSearchOptimizer

Threshold optimizer.

Provides interface to efficient brute force prediction-related parameters in separate optimize step. For example: classification threshold or score function kwargs. ‘MockOptimizer’ avoids pipeline refit for such cases. Internally mlshell.model_selection.cross_val_predict called with specified method and hp optimized on output prediction.

Parameters
  • pipeline (sklearn estimator) – See corresponding argument for sklearn.model_selection.RandomizedSearchCV.

  • hp_grid (dict) – Specify only hp supported mock optimization: should not depends on prediction. If {}, mlshell.custom.MockClassifier or mlshell.custom.MockRegressor used for compliance.

  • scoring (string, callable, list/tuple, dict, optional (default=None)) – See corresponding argument in sklearn.model_selection. RandomizedSearchCV.

  • method (str {'predict_proba', 'predict'}, optional (default='predict')) – Set predict_proba if classifier supported and if any metric needs_proba. See corresponding argument for mlshell.model_selection.cross_val_predict.

  • **kwargs (dict) – Kwargs for sklearn.model_selection.RandomizedSearchCV. If kwargs[‘n_iter’]=None, replaced with number of hp combinations in hp_grid.

Notes

To brute force threshold, set method to ‘predict_proba’. To brute force scorer kwargs alone could be ‘predict’ or ‘predict_proba’ depends on if scoring needs probabilities.

Methods

dump_runs(logger, dirpath, pipeline, …)

Dump results.

update_best(prev)

Combine results from multi-stage optimization.

fit

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)

Dump results.

Parameters
  • logger (logging.Logger) – Logger.

  • dirpath (str) – Absolute path to dump dir.

  • pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.

  • dataset (mlshell.Dataset) – Dataset used for optimizer.fit.

  • **kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

  • ‘id’ random UUID for run (hp combination).

  • All pipeline parameters.

  • Grid search output runs keys.

  • Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.

  • Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

update_best(prev)

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict

List of cv_results_['params'] for all runs in stages.

best_params_dict

Best estimator tuned params from all optimization stages.

best_estimator_sklearn estimator

Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

best_score_tuple

Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

  • best_estimator_’ key to update pipeline in objects.

  • ‘params’ in built-in plotter.

  • best_score_’ in dump/dump_pred for file names.

class mlshell.model_selection.Validator

Bases: object

Validate fitted pipeline.

Methods

validate(pipeline, metrics, datasets, logger)

Evaluate metrics on pipeline.

validate(pipeline, metrics, datasets, logger, method='metric', vector=False)

Evaluate metrics on pipeline.

Parameters
  • pipeline (mlshell.Pipeline) – Fitted model.

  • metrics (list of mlshell.Metric) – Metrics to evaluate.

  • datasets (list of mlshell.Dataset) – Dataset to evaluate on. For classification dataset.meta should contains pos_labels_ind key.

  • method ('metric', 'scorer' or 'vector') – If ‘metric’, efficient evaluation (reuse y_pred) via score_func(y, y_pred, **kwargs). If ‘scorer’, evaluate via scorer(pipeline, x, y). If ‘vector’, evaluate vectorized score via score_func_vector(y, y_pred, **kwargs).

  • vector (bool) – If True and method='metric', score_func_vector used instead of score_func to evaluate vectorized score (if available). Ignored for method='scorer'.

  • logger (logging.Logger) – Logger.

Returns

scores – Resulted scores {‘dataset_id’:{‘metric_id’: score}}.

Return type

dict

mlshell.model_selection.cross_val_predict(*args, **kwargs)

Extended sklearn.model_selection.cross_val_predict().

TimeSplitter support added (first fold prediction absent).

Parameters
Returns

  • y_pred_oof (numpy.ndarray, list of numpy.ndarray) – If method=predict_proba: OOF probability predictions of shape [n_test_samples, n_classes] or [n_outputs, n_test_samples, n_classes] for multi-output. If method=predict OOF predict of shape [n_test_samples] or [n_test_samples, n_outputs].

  • index_oof (numpy.ndarray, list of numpy.ndarray) – Samples reset indices where predictions available of shape [n_test_samples,].

class mlshell.model_selection.Resolver

Bases: object

Resolve dataset-related pipeline hyper-parameter.

Interface: resolve, th_resolver.

For example, numeric/categorical features indices are dataset dependent. Resolver allows to set them before fit/optimize step.

Methods

calc_th_range(y_true, y_pred_proba, …[, …])

Calculate threshold range from OOF ROC curve.

resolve(hp_name, value, pipeline, dataset, …)

Resolve hyper-parameter value.

th_resolver(pipeline, dataset, **kwargs)

Calculate threshold range.

resolve(hp_name, value, pipeline, dataset, **kwargs)

Resolve hyper-parameter value.

Parameters
  • hp_name (str) – Hyper-parameter identifier.

  • value (any objects) – Value to resolve.

  • pipeline (mlshell.Pipeline) – Pipeline contained hp_name in pipeline.get_params().

  • dataset (mlshell.Dataset) – Dataset.

  • **kwargs (dict) – Additional kwargs to pass in corresponding resolver endpoint.

Returns

value – Resolved value. If no resolver endpoint, return value unchanged.

Return type

some object

Notes

Currently supported hp_name for mlshell.pipeline.Steps :

process_parallel__pipeline_categoric__select_columns__kw_args

dataset.meta[‘categoric_ind_name’].

process_parallel__pipeline_numeric__select_columns__kw_args

dataset.meta[‘numeric_ind_name’].

estimate__apply_threshold__threshold

Resolver.th_resolver().

estimate__apply_threshold__params

{i: dataset.meta[i] for i in [‘pos_labels_ind’, ‘pos_labels’, ‘classes’]}

th_resolver(pipeline, dataset, **kwargs)

Calculate threshold range.

If necessary to optimize threshold simultaneously with other hps, extract optimal thresholds values from data in advance could provides more directed tuning, than use random values.

Parameters
Raises

ValueError – If kwargs key ‘method’ is absent or kwargs[‘method’] != ‘predict_proba’.

Returns

th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples]/[samples, n_outputs] for multi-output. In multi-output case each target has separate th_range of length samples, output contains concatenate / merge or combined ranges depends on mlshell.model_selection.Resolver.calc_th_range() multi_output argument.

Return type

numpy.ndarray, list of numpy.ndarray

calc_th_range(y_true, y_pred_proba, pos_labels, pos_labels_ind, metric=None, samples=10, sampler=None, multi_output='concat', plot_flag=False, roc_auc_kwargs=None)

Calculate threshold range from OOF ROC curve.

Parameters
  • y_true (numpy.ndarray) – Target(s) of shape [n_samples,] or [n_samples, n_outputs] for multi-output.

  • y_pred_proba (numpy.ndarray, list of numpy.ndarray) – Probability prediction of shape [n_samples, n_classes] or [n_outputs, n_samples, n_classes] for multi-output.

  • pos_labels (list) – List of positive labels for each target.

  • pos_labels_ind (list) – List of positive labels index in numpy.unique() for each target.

  • metric (callable, optional (default=None)) – metric(fpr, tpr, th_) should returns optimal threshold value, corresponding th_ index and vector for metric visualization ofshape [n_samples,]. If None, tpr/(fpr+tpr) is used.

  • samples (int, optional (default=10)) – Number of unique threshold values to sample (should be enough data).

  • sampler (callable, optional (default=None)) – sampler(optimum, th_, samples) should returns: (sub-range of th_, original index of sub-range).If None, linear sample from [optimum/100; 2*optimum] with limits [np.min(th_), 1].

  • multi_output (str {'merge','product','concat'}, optional (default='concat')) – For multi-output case, either merge th_range for each target or find all combination or concatenate ranges. See notes below.

  • plot_flag (bool, optional (default=False)) – If True, plot ROC curve and resulted th range.

  • roc_auc_kwargs (dict, optional (default=None)) – Additional kwargs to pass in sklearn.metrics.roc_auc_score() . If None, {}.

Returns

th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples] for multi-output.

Return type

numpy.ndarray, list of numpy.ndarray

Notes

For multi-output if th_range_1 = [0.1,0.2], th_range_1 = [0.3, 0.4]: concat => [(0.1, 0.3), (0.2, 0.4)] product => [(0.1, 0.3), (0.1, 0.4), (0.2, 0.3), (0.2, 0.4)] merge => [(0.1, 0.1), (0.2, 0.2), (0.3, 0.3), (0.4, 0.4)]