mlshell.model_selection¶
The mlshell.model_selection
contains hyper-parameters tuning utils.
Functions
|
Classes
Estimator always predicts train feature. |
|
|
Threshold optimizer. |
Estimator always predicts features. |
|
Unified optimizer interface. |
|
|
Transformer applies predict_proba on features. |
|
Wrapper around |
|
Resolve dataset-related pipeline hyper-parameter. |
|
Estimator applies classification threshold. |
Validate fitted pipeline. |
-
class
mlshell.model_selection.
PredictionTransformer
(classifier)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
,sklearn.base.MetaEstimatorMixin
Transformer applies predict_proba on features.
- Parameters
classifier (classifier object) – Classifier supported predict_proba.
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
transform
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
-
class
mlshell.model_selection.
ThresholdClassifier
(params, threshold=None)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Estimator applies classification threshold.
Classify samples based on whether they are above of below threshold. Awaits for prediction probabilities in features.
- Parameters
params (dict) –
Parameters combined in dictionary to set together. {
- ‘classes’: list of
numpy.ndarray
List of sorted unique labels for each target(s) (n_outputs, n_classes).
- ’pos_labels’: list
List of “positive” label(s) for target(s) (n_outputs,).
- ’pos_labels_ind’: list
List of “positive” label(s) index in np.unique(target) for target(s) (n_outputs).
}
- ‘classes’: list of
threshold (float [0,1], list of float [0,1], optional(default=None)) – Classification threshold. For multi-output target list of [n_outputs]. If None,
numpy.argmax()
(in binary case equivalent to 0.5). If positive class probability P(pos_label) = 1 - P(neg_labels) > th_ for some sample, classifier predict pos_label, else label in neg_labels with max probability.
Notes
- Will be replaced with:
- Attributes
- classes_
Methods
get_params
([deep])Get parameters for this estimator.
score
(X, y[, sample_weight])Return the mean accuracy on the given test data and labels.
set_params
(**params)Set the parameters of this estimator.
fit
predict
predict_proba
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
score
(X, y, sample_weight=None)¶ Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – Mean accuracy of self.predict(X) wrt. y.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
-
class
mlshell.model_selection.
MockClassifier
¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Estimator always predicts train feature.
Methods
get_params
([deep])Get parameters for this estimator.
score
(X, y[, sample_weight])Return the mean accuracy on the given test data and labels.
set_params
(**params)Set the parameters of this estimator.
fit
predict
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
score
(X, y, sample_weight=None)¶ Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – Mean accuracy of self.predict(X) wrt. y.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
-
-
class
mlshell.model_selection.
MockRegressor
¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
Estimator always predicts features.
Methods
get_params
([deep])Get parameters for this estimator.
score
(X, y[, sample_weight])Return the coefficient of determination R^2 of the prediction.
set_params
(**params)Set the parameters of this estimator.
fit
predict
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
score
(X, y, sample_weight=None)¶ Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – R^2 of self.predict(X) wrt. y.
- Return type
Notes
The R2 score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.
-
-
class
mlshell.model_selection.
Optimizer
¶ Bases:
object
Unified optimizer interface.
Implements interface to access arbitrary optimizer. Interface: dump_runs, update_best and all underlying optimizer methods.
-
optimizer
¶ Underlying optimizer.
- Type
sklearn.model_selection.BaseSearchCV
Notes
Calling unspecified methods are redirected to underlying optimizer object.
Methods
dump_runs
(logger, dirpath, pipeline, …)Dump results.
update_best
(prev)Combine results from multi-stage optimization.
-
update_best
(prev)¶ Combine results from multi-stage optimization.
The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.
- Parameters
prev (dict) – Previous stage
update_best
output for some pipeline-data pair. Initially set to {}. Seeupdate_best
output format.- Returns
nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {
- ‘params’: list of dict
List of
cv_results_['params']
for all runs in stages.- ’best_params_’dict
Best estimator tuned params from all optimization stages.
- ’best_estimator_’
sklearn
estimator Best estimator
optimizer.best_estimator_
if exist, elseoptimizer.estimator.set_params(**best_params_))
( if not ‘refit’ is True).- ’best_score_’tuple
Best score
('scorer_id', optimizer.best_score_)
, wherescorer_id=str(optimizer.refit)
. If best_score_ is absent,('', float('-inf'))
used.
}
- Return type
Notes
mlshell.Workflow
utilize:‘best_estimator_’ key to update pipeline in
objects
.‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.
-
dump_runs
(logger, dirpath, pipeline, dataset, **kwargs)¶ Dump results.
- Parameters
logger (
logging.Logger
) – Logger.dirpath (str) – Absolute path to dump dir.
pipeline (
mlshell.Pipeline
) – Pipeline used for optimizer.fit.dataset (
mlshell.Dataset
) – Dataset used for optimizer.fit.**kwargs (dict) – Additional kwargs to pass in low-level dump function.
Notes
Resulted file name
<timestamp>_runs.csv
. Each row corresponds to run, columns names:‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output
runs
keys.Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.
Hash could alter when interpreter restarted, because of address has changed for some underlying function.
-
-
class
mlshell.model_selection.
RandomizedSearchOptimizer
(pipeline, hp_grid, scoring, **kwargs)¶ Bases:
mlshell.blocks.model_selection.search.Optimizer
Wrapper around
sklearn.model_selection.RandomizedSearchCV
.- Parameters
pipeline (
sklearn
estimator) – See corresponding argument forsklearn.model_selection.RandomizedSearchCV
.hp_grid (dict) – See corresponding argument for
sklearn.model_selection.RandomizedSearchCV
. Only dict type forhp_grid
currently supported.scoring (string, callable, list/tuple, dict, optional (default=None)) – See corresponding argument for
sklearn.model_selection.RandomizedSearchCV
.**kwargs (dict) – Kwargs for
sklearn.model_selection.RandomizedSearchCV
. If kwargs[‘n_iter’]=None, replaced with number of hp combinations inhp_grid
(“1” if only distributions found or empty).
Methods
dump_runs
(logger, dirpath, pipeline, …)Dump results.
update_best
(prev)Combine results from multi-stage optimization.
-
dump_runs
(logger, dirpath, pipeline, dataset, **kwargs)¶ Dump results.
- Parameters
logger (
logging.Logger
) – Logger.dirpath (str) – Absolute path to dump dir.
pipeline (
mlshell.Pipeline
) – Pipeline used for optimizer.fit.dataset (
mlshell.Dataset
) – Dataset used for optimizer.fit.**kwargs (dict) – Additional kwargs to pass in low-level dump function.
Notes
Resulted file name
<timestamp>_runs.csv
. Each row corresponds to run, columns names:‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output
runs
keys.Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.
Hash could alter when interpreter restarted, because of address has changed for some underlying function.
-
update_best
(prev)¶ Combine results from multi-stage optimization.
The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.
- Parameters
prev (dict) – Previous stage
update_best
output for some pipeline-data pair. Initially set to {}. Seeupdate_best
output format.- Returns
nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {
- ‘params’: list of dict
List of
cv_results_['params']
for all runs in stages.- ’best_params_’dict
Best estimator tuned params from all optimization stages.
- ’best_estimator_’
sklearn
estimator Best estimator
optimizer.best_estimator_
if exist, elseoptimizer.estimator.set_params(**best_params_))
( if not ‘refit’ is True).- ’best_score_’tuple
Best score
('scorer_id', optimizer.best_score_)
, wherescorer_id=str(optimizer.refit)
. If best_score_ is absent,('', float('-inf'))
used.
}
- Return type
Notes
mlshell.Workflow
utilize:‘best_estimator_’ key to update pipeline in
objects
.‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.
-
class
mlshell.model_selection.
MockOptimizer
(pipeline, hp_grid, scoring, method='predict', **kwargs)¶ Bases:
mlshell.blocks.model_selection.search.RandomizedSearchOptimizer
Threshold optimizer.
Provides interface to efficient brute force prediction-related parameters in separate optimize step. For example: classification threshold or score function kwargs. ‘MockOptimizer’ avoids pipeline refit for such cases. Internally
mlshell.model_selection.cross_val_predict
called with specifiedmethod
and hp optimized on output prediction.- Parameters
pipeline (
sklearn
estimator) – See corresponding argument forsklearn.model_selection.RandomizedSearchCV
.hp_grid (dict) – Specify only
hp
supported mock optimization: should not depends on prediction. If {},mlshell.custom.MockClassifier
ormlshell.custom.MockRegressor
used for compliance.scoring (string, callable, list/tuple, dict, optional (default=None)) – See corresponding argument in
sklearn.model_selection. RandomizedSearchCV
.method (str {'predict_proba', 'predict'}, optional (default='predict')) – Set
predict_proba
if classifier supported and if any metricneeds_proba
. See corresponding argument formlshell.model_selection.cross_val_predict
.**kwargs (dict) – Kwargs for
sklearn.model_selection.RandomizedSearchCV
. If kwargs[‘n_iter’]=None, replaced with number of hp combinations inhp_grid
.
Notes
To brute force threshold, set method to ‘predict_proba’. To brute force scorer kwargs alone could be ‘predict’ or ‘predict_proba’ depends on if scoring needs probabilities.
Methods
dump_runs
(logger, dirpath, pipeline, …)Dump results.
update_best
(prev)Combine results from multi-stage optimization.
fit
-
dump_runs
(logger, dirpath, pipeline, dataset, **kwargs)¶ Dump results.
- Parameters
logger (
logging.Logger
) – Logger.dirpath (str) – Absolute path to dump dir.
pipeline (
mlshell.Pipeline
) – Pipeline used for optimizer.fit.dataset (
mlshell.Dataset
) – Dataset used for optimizer.fit.**kwargs (dict) – Additional kwargs to pass in low-level dump function.
Notes
Resulted file name
<timestamp>_runs.csv
. Each row corresponds to run, columns names:‘id’ random UUID for run (hp combination).
All pipeline parameters.
Grid search output
runs
keys.Pipeline info: ‘pipeline__id’, ‘pipeline__hash’, ‘pipeline__type’.
Dataset info: ‘dataset__id’, ‘dataset__hash’.
Hash could alter when interpreter restarted, because of address has changed for some underlying function.
-
update_best
(prev)¶ Combine results from multi-stage optimization.
The logic of choosing the best run is set here. Currently best hp combination and corresponding estimator taken from the last stage. But if any hp brute force in more than one stage, more complicated rule is required to merge runs.
- Parameters
prev (dict) – Previous stage
update_best
output for some pipeline-data pair. Initially set to {}. Seeupdate_best
output format.- Returns
nxt – Result of merging runs on all optimization stages for some pipeline-data pair: {
- ‘params’: list of dict
List of
cv_results_['params']
for all runs in stages.- ’best_params_’dict
Best estimator tuned params from all optimization stages.
- ’best_estimator_’
sklearn
estimator Best estimator
optimizer.best_estimator_
if exist, elseoptimizer.estimator.set_params(**best_params_))
( if not ‘refit’ is True).- ’best_score_’tuple
Best score
('scorer_id', optimizer.best_score_)
, wherescorer_id=str(optimizer.refit)
. If best_score_ is absent,('', float('-inf'))
used.
}
- Return type
Notes
mlshell.Workflow
utilize:‘best_estimator_’ key to update pipeline in
objects
.‘params’ in built-in plotter.
‘best_score_’ in dump/dump_pred for file names.
-
class
mlshell.model_selection.
Validator
¶ Bases:
object
Validate fitted pipeline.
Methods
validate
(pipeline, metrics, datasets, logger)Evaluate metrics on pipeline.
-
validate
(pipeline, metrics, datasets, logger, method='metric', vector=False)¶ Evaluate metrics on pipeline.
- Parameters
pipeline (
mlshell.Pipeline
) – Fitted model.metrics (list of
mlshell.Metric
) – Metrics to evaluate.datasets (list of
mlshell.Dataset
) – Dataset to evaluate on. For classificationdataset.meta
should containspos_labels_ind
key.method ('metric', 'scorer' or 'vector') – If ‘metric’, efficient evaluation (reuse y_pred) via
score_func(y, y_pred, **kwargs)
. If ‘scorer’, evaluate viascorer(pipeline, x, y)
. If ‘vector’, evaluate vectorized score viascore_func_vector(y, y_pred, **kwargs)
.vector (bool) – If True and
method='metric'
,score_func_vector
used instead ofscore_func
to evaluate vectorized score (if available). Ignored formethod='scorer'
.logger (
logging.Logger
) – Logger.
- Returns
scores – Resulted scores {‘dataset_id’:{‘metric_id’: score}}.
- Return type
-
-
mlshell.model_selection.
cross_val_predict
(*args, **kwargs)¶ Extended
sklearn.model_selection.cross_val_predict()
.TimeSplitter support added (first fold prediction absent).
- Parameters
*args (list) – Passed to
sklearn.model_selection.cross_val_predict()
.**kwargs (dict) – Passed to
sklearn.model_selection.cross_val_predict()
.
- Returns
y_pred_oof (
numpy.ndarray
, list ofnumpy.ndarray
) – If method=predict_proba: OOF probability predictions of shape [n_test_samples, n_classes] or [n_outputs, n_test_samples, n_classes] for multi-output. If method=predict OOF predict of shape [n_test_samples] or [n_test_samples, n_outputs].index_oof (
numpy.ndarray
, list ofnumpy.ndarray
) – Samples reset indices where predictions available of shape [n_test_samples,].
-
class
mlshell.model_selection.
Resolver
¶ Bases:
object
Resolve dataset-related pipeline hyper-parameter.
Interface: resolve, th_resolver.
For example, numeric/categorical features indices are dataset dependent. Resolver allows to set them before fit/optimize step.
Methods
calc_th_range
(y_true, y_pred_proba, …[, …])Calculate threshold range from OOF ROC curve.
resolve
(hp_name, value, pipeline, dataset, …)Resolve hyper-parameter value.
th_resolver
(pipeline, dataset, **kwargs)Calculate threshold range.
-
resolve
(hp_name, value, pipeline, dataset, **kwargs)¶ Resolve hyper-parameter value.
- Parameters
hp_name (str) – Hyper-parameter identifier.
value (any objects) – Value to resolve.
pipeline (
mlshell.Pipeline
) – Pipeline containedhp_name
inpipeline.get_params()
.dataset (
mlshell.Dataset
) – Dataset.**kwargs (dict) – Additional kwargs to pass in corresponding resolver endpoint.
- Returns
value – Resolved value. If no resolver endpoint, return value unchanged.
- Return type
some object
Notes
Currently supported hp_name for
mlshell.pipeline.Steps
:process_parallel__pipeline_categoric__select_columns__kw_args
dataset.meta[‘categoric_ind_name’].
process_parallel__pipeline_numeric__select_columns__kw_args
dataset.meta[‘numeric_ind_name’].
estimate__apply_threshold__threshold
Resolver.th_resolver()
.estimate__apply_threshold__params
{i: dataset.meta[i] for i in [‘pos_labels_ind’, ‘pos_labels’, ‘classes’]}
-
th_resolver
(pipeline, dataset, **kwargs)¶ Calculate threshold range.
If necessary to optimize threshold simultaneously with other hps, extract optimal thresholds values from data in advance could provides more directed tuning, than use random values.
- Get predict_proba:
- Get tpr, fpr, th_range relative to positive label:
- Sampling thresholds close to optimum of predefined metric:
- Parameters
pipeline (
mlshell.Pipeline
) – Pipeline.dataset (
mlshell.Dataset
) – Dataset.**kwargs (dict) –
- kwargs[‘cross_val_predict’] to pass in:
sklearn.model_selection.cross_val_predict()
.method
always should be set to ‘predict_proba’,y
argument ignored.- kwargs[‘calc_th_range’] to pass in:
- Raises
ValueError – If kwargs key ‘method’ is absent or kwargs[‘method’] != ‘predict_proba’.
- Returns
th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples]/[samples, n_outputs] for multi-output. In multi-output case each target has separate th_range of length
samples
, output contains concatenate / merge or combined ranges depends onmlshell.model_selection.Resolver.calc_th_range()
multi_output
argument.- Return type
numpy.ndarray
, list ofnumpy.ndarray
-
calc_th_range
(y_true, y_pred_proba, pos_labels, pos_labels_ind, metric=None, samples=10, sampler=None, multi_output='concat', plot_flag=False, roc_auc_kwargs=None)¶ Calculate threshold range from OOF ROC curve.
- Parameters
y_true (
numpy.ndarray
) – Target(s) of shape [n_samples,] or [n_samples, n_outputs] for multi-output.y_pred_proba (
numpy.ndarray
, list ofnumpy.ndarray
) – Probability prediction of shape [n_samples, n_classes] or [n_outputs, n_samples, n_classes] for multi-output.pos_labels (list) – List of positive labels for each target.
pos_labels_ind (list) – List of positive labels index in
numpy.unique()
for each target.metric (callable, optional (default=None)) –
metric(fpr, tpr, th_)
should returns optimal threshold value, correspondingth_
index and vector for metric visualization ofshape [n_samples,]. If None,tpr/(fpr+tpr)
is used.samples (int, optional (default=10)) – Number of unique threshold values to sample (should be enough data).
sampler (callable, optional (default=None)) –
sampler(optimum, th_, samples)
should returns: (sub-range of th_, original index of sub-range).If None, linear sample from[optimum/100; 2*optimum]
with limits[np.min(th_), 1]
.multi_output (str {'merge','product','concat'}, optional (default='concat')) – For multi-output case, either merge th_range for each target or find all combination or concatenate ranges. See notes below.
plot_flag (bool, optional (default=False)) – If True, plot ROC curve and resulted th range.
roc_auc_kwargs (dict, optional (default=None)) – Additional kwargs to pass in
sklearn.metrics.roc_auc_score()
. If None, {}.
- Returns
th_range – Thresholds array sorted ascending of shape [samples] or [n_outputs * samples] for multi-output.
- Return type
numpy.ndarray
, list ofnumpy.ndarray
Notes
For multi-output if th_range_1 = [0.1,0.2], th_range_1 = [0.3, 0.4]: concat => [(0.1, 0.3), (0.2, 0.4)] product => [(0.1, 0.3), (0.1, 0.4), (0.2, 0.3), (0.2, 0.4)] merge => [(0.1, 0.1), (0.2, 0.2), (0.3, 0.3), (0.4, 0.4)]
-