mlshell.model_selection¶

The mlshell.model_selection contains hyper-parameters tuning utils.

Functions

cross_val_predict(*args, **kwargs)

Extended sklearn.model_selection.cross_val_predict().

Classes

`MockClassifier`()	Estimator always predicts train feature.
`MockOptimizer`(pipeline, hp_grid, scoring[, …])	Threshold optimizer.
`MockRegressor`()	Estimator always predicts features.
`Optimizer`()	Unified optimizer interface.
`PredictionTransformer`(classifier)	Transformer applies predict_proba on features.
`RandomizedSearchOptimizer`(pipeline, hp_grid, …)	Wrapper around `sklearn.model_selection.RandomizedSearchCV`.
`Resolver`()	Resolve dataset-related pipeline hyper-parameter.
`ThresholdClassifier`(params[, threshold])	Estimator applies classification threshold.
`Validator`()	Validate fitted pipeline.

class mlshell.model_selection.PredictionTransformer(classifier)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin, sklearn.base.MetaEstimatorMixin

Transformer applies predict_proba on features.

Parameters: classifier (classifier object) – Classifier supported predict_proba.

Methods

`fit_transform`(X[, y])	Fit to data, then transform it.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.

fit
transform

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

class mlshell.model_selection.ThresholdClassifier(params, threshold=None)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Estimator applies classification threshold.

Classify samples based on whether they are above of below threshold. Awaits for prediction probabilities in features.

Parameters

params (dict) –
Parameters combined in dictionary to set together. {

‘classes’: list of numpy.ndarray
List of sorted unique labels for each target(s) (n_outputs, n_classes).

’pos_labels’: list
List of “positive” label(s) for target(s) (n_outputs,).

’pos_labels_ind’: list
List of “positive” label(s) index in np.unique(target) for target(s) (n_outputs).

}
threshold (float [0,1], list of float [0,1], optional(default=None)) – Classification threshold. For multi-output target list of [n_outputs]. If None, numpy.argmax() (in binary case equivalent to 0.5). If positive class probability P(pos_label) = 1 - P(neg_labels) > th_ for some sample, classifier predict pos_label, else label in neg_labels with max probability.

Notes

Will be replaced with:: https://github.com/scikit-learn/scikit-learn/pull/16525.

Attributes

classes_

Methods

`get_params`([deep])	Get parameters for this estimator.
`score`(X, y[, sample_weight])	Return the mean accuracy on the given test data and labels.
`set_params`(**params)	Set the parameters of this estimator.

fit
predict
predict_proba

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

score(X, y, sample_weight=None)¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

class mlshell.model_selection.MockClassifier¶

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Estimator always predicts train feature.

Methods

`get_params`([deep])	Get parameters for this estimator.
`score`(X, y[, sample_weight])	Return the mean accuracy on the given test data and labels.
`set_params`(**params)	Set the parameters of this estimator.

fit
predict

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

score(X, y, sample_weight=None)¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

class mlshell.model_selection.MockRegressor¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Estimator always predicts features.

Methods

`get_params`([deep])	Get parameters for this estimator.
`score`(X, y[, sample_weight])	Return the coefficient of determination R^2 of the prediction.
`set_params`(**params)	Set the parameters of this estimator.

fit
predict

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

score(X, y, sample_weight=None)¶

Return the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

Parameters

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – R^2 of self.predict(X) wrt. y.

Return type

float

Notes

The R2 score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

class mlshell.model_selection.Optimizer¶

Bases: object

Unified optimizer interface.

Implements interface to access arbitrary optimizer. Interface: dump_runs, update_best and all underlying optimizer methods.

optimizer¶

Underlying optimizer.

Type: sklearn.model_selection.BaseSearchCV

Notes

Calling unspecified methods are redirected to underlying optimizer object.

Methods

`dump_runs`(logger, dirpath, pipeline, …)	Dump results.
`update_best`(prev)	Combine results from multi-stage optimization.

update_best(prev)¶

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict
List of cv_results_['params'] for all runs in stages.

’best_params_’dict
Best estimator tuned params from all optimization stages.

’best_estimator_’sklearn estimator
Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

’best_score_’tuple
Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

‘best_estimator_’ key to update pipeline in objects.
‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)¶

Dump results.

Parameters

logger (logging.Logger) – Logger.
dirpath (str) – Absolute path to dump dir.
pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.
dataset (mlshell.Dataset) – Dataset used for optimizer.fit.
**kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output runs keys.
Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

class mlshell.model_selection.RandomizedSearchOptimizer(pipeline, hp_grid, scoring, **kwargs)¶

Bases: mlshell.blocks.model_selection.search.Optimizer

Wrapper around sklearn.model_selection.RandomizedSearchCV.

Parameters

pipeline (sklearn estimator) – See corresponding argument for sklearn.model_selection.RandomizedSearchCV.
hp_grid (dict) – See corresponding argument for sklearn.model_selection.RandomizedSearchCV. Only dict type for hp_grid currently supported.
scoring (string, callable, list/tuple, dict, optional (default=None)) – See corresponding argument for sklearn.model_selection.RandomizedSearchCV.
**kwargs (dict) – Kwargs for sklearn.model_selection.RandomizedSearchCV. If kwargs[‘n_iter’]=None, replaced with number of hp combinations in hp_grid (“1” if only distributions found or empty).

Methods

`dump_runs`(logger, dirpath, pipeline, …)	Dump results.
`update_best`(prev)	Combine results from multi-stage optimization.

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)¶

Dump results.

Parameters

logger (logging.Logger) – Logger.
dirpath (str) – Absolute path to dump dir.
pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.
dataset (mlshell.Dataset) – Dataset used for optimizer.fit.
**kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output runs keys.
Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

update_best(prev)¶

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict
List of cv_results_['params'] for all runs in stages.

’best_params_’dict
Best estimator tuned params from all optimization stages.

’best_estimator_’sklearn estimator
Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

’best_score_’tuple
Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

‘best_estimator_’ key to update pipeline in objects.
‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.

class mlshell.model_selection.MockOptimizer(pipeline, hp_grid, scoring, method='predict', **kwargs)¶

Bases: mlshell.blocks.model_selection.search.RandomizedSearchOptimizer

Threshold optimizer.

Provides interface to efficient brute force prediction-related parameters in separate optimize step. For example: classification threshold or score function kwargs. ‘MockOptimizer’ avoids pipeline refit for such cases. Internally mlshell.model_selection.cross_val_predict called with specified method and hp optimized on output prediction.

Parameters

pipeline (sklearn estimator) – See corresponding argument for sklearn.model_selection.RandomizedSearchCV.
hp_grid (dict) – Specify only hp supported mock optimization: should not depends on prediction. If {}, mlshell.custom.MockClassifier or mlshell.custom.MockRegressor used for compliance.
scoring (string, callable, list/tuple, dict, optional (default=None)) – See corresponding argument in sklearn.model_selection. RandomizedSearchCV.
method (str {'predict_proba', 'predict'}, optional (default='predict')) – Set predict_proba if classifier supported and if any metric needs_proba. See corresponding argument for mlshell.model_selection.cross_val_predict.
**kwargs (dict) – Kwargs for sklearn.model_selection.RandomizedSearchCV. If kwargs[‘n_iter’]=None, replaced with number of hp combinations in hp_grid.

Notes

To brute force threshold, set method to ‘predict_proba’. To brute force scorer kwargs alone could be ‘predict’ or ‘predict_proba’ depends on if scoring needs probabilities.

Methods

`dump_runs`(logger, dirpath, pipeline, …)	Dump results.
`update_best`(prev)	Combine results from multi-stage optimization.

fit

dump_runs(logger, dirpath, pipeline, dataset, **kwargs)¶

Dump results.

Parameters

logger (logging.Logger) – Logger.
dirpath (str) – Absolute path to dump dir.
pipeline (mlshell.Pipeline) – Pipeline used for optimizer.fit.
dataset (mlshell.Dataset) – Dataset used for optimizer.fit.
**kwargs (dict) – Additional kwargs to pass in low-level dump function.

Notes

Resulted file name <timestamp>_runs.csv. Each row corresponds to run, columns names:

‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output runs keys.
Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.

Hash could alter when interpreter restarted, because of address has changed for some underlying function.

update_best(prev)¶

Combine results from multi-stage optimization.

The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.

Parameters

prev (dict) – Previous stage update_best output for some pipeline-data pair. Initially set to {}. See update_best output format.

Returns

nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {

‘params’: list of dict
List of cv_results_['params'] for all runs in stages.

’best_params_’dict
Best estimator tuned params from all optimization stages.

’best_estimator_’sklearn estimator
Best estimator optimizer.best_estimator_ if exist, else optimizer.estimator.set_params(**best_params_)) ( if not ‘refit’ is True).

’best_score_’tuple
Best score ('scorer_id', optimizer.best_score_) , where scorer_id=str(optimizer.refit). If best_score_ is absent, ('', float('-inf')) used.

}

Return type

dict

Notes

mlshell.Workflow utilize:

‘best_estimator_’ key to update pipeline in objects.
‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.

class mlshell.model_selection.Validator¶

Bases: object

Validate fitted pipeline.

Methods

validate(pipeline, metrics, datasets, logger)

Evaluate metrics on pipeline.

validate(pipeline, metrics, datasets, logger, method='metric', vector=False)¶

Evaluate metrics on pipeline.

Parameters

pipeline (mlshell.Pipeline) – Fitted model.
metrics (list of mlshell.Metric) – Metrics to evaluate.
datasets (list of mlshell.Dataset) – Dataset to evaluate on. For classification dataset.meta should contains pos_labels_ind key.
method ('metric', 'scorer' or 'vector') – If ‘metric’, efficient evaluation (reuse y_pred) via score_func(y, y_pred, **kwargs). If ‘scorer’, evaluate via scorer(pipeline, x, y). If ‘vector’, evaluate vectorized score via score_func_vector(y, y_pred, **kwargs).
vector (bool) – If True and method='metric', score_func_vector used instead of score_func to evaluate vectorized score (if available). Ignored for method='scorer'.
logger (logging.Logger) – Logger.

Returns

scores – Resulted scores {‘dataset_id’:{‘metric_id’: score}}.

Return type

dict

mlshell.model_selection.cross_val_predict(*args, **kwargs)¶

Extended sklearn.model_selection.cross_val_predict().

TimeSplitter support added (first fold prediction absent).

Parameters

*args (list) – Passed to sklearn.model_selection.cross_val_predict() .
**kwargs (dict) – Passed to sklearn.model_selection.cross_val_predict() .

Returns

y_pred_oof (numpy.ndarray, list of numpy.ndarray) – If method=predict_proba: OOF probability predictions of shape [n_test_samples, n_classes] or [n_outputs, n_test_samples, n_classes] for multi-output. If method=predict OOF predict of shape [n_test_samples] or [n_test_samples, n_outputs].
index_oof (numpy.ndarray, list of numpy.ndarray) – Samples reset indices where predictions available of shape [n_test_samples,].

class mlshell.model_selection.Resolver¶

Bases: object

Resolve dataset-related pipeline hyper-parameter.

Interface: resolve, th_resolver.

For example, numeric/categorical features indices are dataset dependent. Resolver allows to set them before fit/optimize step.

Methods

`calc_th_range`(y_true, y_pred_proba, …[, …])	Calculate threshold range from OOF ROC curve.
`resolve`(hp_name, value, pipeline, dataset, …)	Resolve hyper-parameter value.
`th_resolver`(pipeline, dataset, **kwargs)	Calculate threshold range.

resolve(hp_name, value, pipeline, dataset, **kwargs)¶

Resolve hyper-parameter value.

Parameters

hp_name (str) – Hyper-parameter identifier.
value (any objects) – Value to resolve.
pipeline (mlshell.Pipeline) – Pipeline contained hp_name in pipeline.get_params().
dataset (mlshell.Dataset) – Dataset.
**kwargs (dict) – Additional kwargs to pass in corresponding resolver endpoint.

Returns

value – Resolved value. If no resolver endpoint, return value unchanged.

Return type

some object

Notes

Currently supported hp_name for mlshell.pipeline.Steps :

process_parallel__pipeline_categoric__select_columns__kw_args: dataset.meta[‘categoric_ind_name’].
process_parallel__pipeline_numeric__select_columns__kw_args: dataset.meta[‘numeric_ind_name’].
estimate__apply_threshold__threshold: Resolver.th_resolver().
estimate__apply_threshold__params: {i: dataset.meta[i] for i in [‘pos_labels_ind’, ‘pos_labels’, ‘classes’]}

th_resolver(pipeline, dataset, **kwargs)¶

Calculate threshold range.

If necessary to optimize threshold simultaneously with other hps, extract optimal thresholds values from data in advance could provides more directed tuning, than use random values.

Get predict_proba:
mlshell.model_selection.cross_val_predict .
Get tpr, fpr, th_range relative to positive label:
sklearn.metrics.roc_curve() .
Sampling thresholds close to optimum of predefined metric:
mlshell.model_selection.Resolver.calc_th_range() .

Parameters

pipeline (mlshell.Pipeline) – Pipeline.
dataset (mlshell.Dataset) – Dataset.
**kwargs (dict) –

kwargs[‘cross_val_predict’] to pass in:
sklearn.model_selection.cross_val_predict() . method always should be set to ‘predict_proba’, y argument ignored.

kwargs[‘calc_th_range’] to pass in:
mlshell.model_selection.Resolver.calc_th_range() .

Raises

ValueError – If kwargs key ‘method’ is absent or kwargs[‘method’] != ‘predict_proba’.

Returns

th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples]/[samples, n_outputs] for multi-output. In multi-output case each target has separate th_range of length samples, output contains concatenate / merge or combined ranges depends on mlshell.model_selection.Resolver.calc_th_range() multi_output argument.

Return type

numpy.ndarray, list of numpy.ndarray

calc_th_range(y_true, y_pred_proba, pos_labels, pos_labels_ind, metric=None, samples=10, sampler=None, multi_output='concat', plot_flag=False, roc_auc_kwargs=None)¶

Calculate threshold range from OOF ROC curve.

Parameters

y_true (numpy.ndarray) – Target(s) of shape [n_samples,] or [n_samples, n_outputs] for multi-output.
y_pred_proba (numpy.ndarray, list of numpy.ndarray) – Probability prediction of shape [n_samples, n_classes] or [n_outputs, n_samples, n_classes] for multi-output.
pos_labels (list) – List of positive labels for each target.
pos_labels_ind (list) – List of positive labels index in numpy.unique() for each target.
metric (callable, optional (default=None)) – metric(fpr, tpr, th_) should returns optimal threshold value, corresponding th_ index and vector for metric visualization ofshape [n_samples,]. If None, tpr/(fpr+tpr) is used.
samples (int, optional (default=10)) – Number of unique threshold values to sample (should be enough data).
sampler (callable, optional (default=None)) – sampler(optimum, th_, samples) should returns: (sub-range of th_, original index of sub-range).If None, linear sample from [optimum/100; 2*optimum] with limits [np.min(th_), 1].
multi_output (str {'merge','product','concat'}, optional (default='concat')) – For multi-output case, either merge th_range for each target or find all combination or concatenate ranges. See notes below.
plot_flag (bool, optional (default=False)) – If True, plot ROC curve and resulted th range.
roc_auc_kwargs (dict, optional (default=None)) – Additional kwargs to pass in sklearn.metrics.roc_auc_score() . If None, {}.

Returns

th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples] for multi-output.

Return type

numpy.ndarray, list of numpy.ndarray

Notes

For multi-output if th_range_1 = [0.1,0.2], th_range_1 = [0.3, 0.4]: concat => [(0.1, 0.3), (0.2, 0.4)] product => [(0.1, 0.3), (0.1, 0.4), (0.2, 0.3), (0.2, 0.4)] merge => [(0.1, 0.1), (0.2, 0.2), (0.3, 0.3), (0.4, 0.4)]