mlshell.Workflow¶
-
class
mlshell.
Workflow
(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶ Bases:
pycnfg.producer.Producer
Interface to ML task.
Interface: fit, predict, optimize, validate, dump, plot.
- Parameters
objects (dict) – Dictionary with resulted objects from previous executed producers: {‘section_id__config__id’, object}.
oid (str) – Unique identifier of produced object.
path_id (str, optional (default='default')) – Project path identifier in objects.
logger_id (str, optional (default='default')) – Logger identifier in objects.
-
objects
¶ Dictionary with resulted objects from previous executed producers: {‘section_id__config__id’, object,}
- Type
-
logger
¶ Logger.
- Type
See also
mlshell.Dataset
Dataset interface.
mlshell.Metric
Metric inteface.
mlshell.Pipeline
Pipeline inteface.
Methods
dict_api
(obj[, method])Forwarding api for dictionary object.
dump
(res, pipeline_id[, dirpath])Dump pipeline.
dump_cache
(obj[, prefix, cachedir, pkg])Dump intermediate object state to IO.
fit
(res, pipeline_id, dataset_id[, …])Fit pipeline.
load_cache
(obj[, prefix, cachedir, pkg])Load intermediate object state from IO.
optimize
(res, pipeline_id, dataset_id[, …])Optimize pipeline.
plot
(res, pipeline_id, dataset_id, metric_id)Plot metrics.
predict
(res, pipeline_id, dataset_id[, …])Make and dump prediction.
run
(init, steps)Execute configuration steps.
update
(obj, items)Update key(s) for dictionary object.
validate
(res, pipeline_id, dataset_id, metric_id)Make and score prediction.
-
__init__
(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(objects, oid[, path_id, logger_id])Initialize self.
dict_api
(obj[, method])Forwarding api for dictionary object.
dump
(res, pipeline_id[, dirpath])Dump pipeline.
dump_cache
(obj[, prefix, cachedir, pkg])Dump intermediate object state to IO.
fit
(res, pipeline_id, dataset_id[, …])Fit pipeline.
load_cache
(obj[, prefix, cachedir, pkg])Load intermediate object state from IO.
optimize
(res, pipeline_id, dataset_id[, …])Optimize pipeline.
plot
(res, pipeline_id, dataset_id, metric_id)Plot metrics.
predict
(res, pipeline_id, dataset_id[, …])Make and dump prediction.
run
(init, steps)Execute configuration steps.
update
(obj, items)Update key(s) for dictionary object.
validate
(res, pipeline_id, dataset_id, metric_id)Make and score prediction.
-
fit
(res, pipeline_id, dataset_id, subset_id='train', hp=None, resolver=None, resolve_params=None, fit_params=None)[source]¶ Fit pipeline.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in
objects
. Will be fitted on dataset_id__ subset_id:pipeline.fit(subset.x, subset.y, **fit_params)
.dataset_id (str) – Dataset identifier in
objects
.subset_id (str, optional (default='train')) – Data subset identifier to fit on. If ‘’, use full dataset.
hp (dict, optional (default=None)) – Hyper-parameters to use in pipeline: {hp_name: val/container}. If range provided for any hp, zero position will be used. If None, {}.
resolver (
mlshell.model_selection.Resolver
, optional) – (default=None) If hp value = ‘auto’, hp will be resolved:resolver.resolve()
. Auto initialized if necessary.mlshell.model_selection. Resolver
if None.resolve_params (dict, optional (default=None)) – Additional kwargs to pass in:
resolver.resolve(*args, **resolve_params[hp_name])
. If None, {}.fit_params (dict, optional (default=None)) – Additional kwargs to pass in
pipeline.fit(*args, **fit_params)
. If None, {}.
- Returns
res – Unchanged input, for compliance with producer logic.
- Return type
Notes
Pipeline updated in
objects
attribute.See also
mlshell.model_selection.Resolver
Hp resolver.
-
optimize
(res, pipeline_id, dataset_id, subset_id='train', metric_id=None, hp_grid=None, resolver=None, optimizer=None, dirpath=None, resolve_params=None, fit_params=None, gs_params=None, dump_params=None)[source]¶ Optimize pipeline.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in
objects
. Will be cross-validate on dataset_id__subset_id:optimizer.fit(subset.x, subset.y, **fit_params)
.dataset_id (str) – Dataset identifier in objects.
subset_id (str, optional (default='train')) – Data subset identifier to CV on. If ‘’, use full dataset.
metric_id (str, List/tuple of str, optional (default=None)) – List of ‘metric_id’ to use in optimizer scoring. Known ‘metric_id’ will be resolved via objects or sklearn built-in, otherwise raise
KeyError
. If None, ‘accuracy’ or ‘r2’ depends on pipeline estimator type.hp_grid (dict, optional (default=None)) – Hyper-parameters to grid search: {hp_name: optimizer format}. If None, {}.
resolver (
mlshell.model_selection.Resolver
, optional) – (default=None) If hp value = [‘auto’] inhp_grid
, hp will be resolved viaresolver.resolve()
. Auto initialized if class provided. If None,mlshell.model_selection.Resolver
used.optimizer (
mlshell.model_selection.Optimizer`
, optional) – (default=None) Class to optimizehp_grid
. Will be calledoptimizer(pipeline, hp_grid, scoring, **gs_params).fit(x, y, **fit_params)
. If None,mlshell.model_selection.RandomizedSearchOptimizer
.dirpath (str, optional (default=None)) – Absolute path to the dump result ‘runs’ dir or relative to ‘project__path’ started with ‘./’. If None, “project__path /results/runs” is used. See Notes for runs description.
resolve_params (dict, optional (default=None)) – Additional kwargs to pass in
resolver.resolve(*args, **resolve_params[hp_name])
. If None, {}.fit_params (dict, optional (default=None)) – Additional kwargs to pass in
optimizer.fit(*args, **fit_params)
. If None, {}.gs_params (dict, optional (default=None)) – Additional kwargs to
optimizer(pipeline, hp_grid, scoring, **gs_params)
initialization. If None, {}.dump_params (dict, optional (default=None)) – Additional kwargs to pass in
optimizer.dump_runs(**dump_params)
. If None, {}.
- Returns
res – Input`s key added/updated: {
- ‘runs’: dict
Storage of optimization results for pipeline-data pair. {‘pipeline_id|dataset_id__subset_id’:
optimizer.update_best output}
}
- Return type
Notes
Optimization flow:
Call grid search.
optimizer(pipeline.pipeline, hp_grid, scoring, **gs_params) .fit(x, y, **fit_params)
.Call dump runs.
optimizer.dump_runs(logger, dirpath, **dump_params)
, where each run = probing one hp combination.Combine optimization results with previous for pipeline-data pair:
optimizer.update_best(prev_runs)
.Upfate pipeline object in
objects
.
Onle if ‘best_estimator_’ in ‘runs’.
See also
mlshell.model_selection.Resolver
Hp resolver.
mlshell.model_selection.Optimizer
Hp optimizer.
-
validate
(res, pipeline_id, dataset_id, metric_id, subset_id='train', 'test', validator=None)[source]¶ Make and score prediction.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in
objects
. Will be validated on dataset_id__subset_id.dataset_id (str) – Dataset identifier in objects.
subset_id (str,list/tuple of str, optional (default=('train', 'test'))) – Data subset(s) identifier(s) to validate on. ‘’ for full dataset.
metric_id (srt, list/tuple of str) – Metric(s) identifier in objects.
validator (
mlshell.model_selection.Validator
, optional) –(default=None) – Auto initialized if class provided. If None,
mlshell.model_selection.Validator
.
- Returns
res – Unchanged input, for compliance with producer logic.
- Return type
-
dump
(res, pipeline_id, dirpath=None, **kwargs)[source]¶ Dump pipeline.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in
objects
. Will be dumped viapipeline.dump(**kwargs)
.dirpath (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project__path’ started with ‘./’. If None,”project__path/results/models” is used.
**kwargs (dict) – Additional kwargs to pass in
pipeline.dump(**kwargs)
.
- Returns
res – Unchanged input, for compliance with producer logic.
- Return type
Notes
Resulted filename includes prefix:
workflow_id|pipeline_id|fit_dataset_id|best_score|pipeline_hash| fit_dataset_hash|os_type|timestamp
.fit_dataset_id = None if pipeline not fitted or hasn`t such attribute. The ‘best_score’ available after optimize step(s) only if optimizer supported.
-
predict
(res, pipeline_id, dataset_id, subset_id='', dirpath=None, **kwargs)[source]¶ Make and dump prediction.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in objects to make prediction on dataset_id__subset_id:
pipeline.predict(subset.x)
.dataset_id (str) – Dataset identifier in objects.
subset_id (str, optional (default='test')) – Data subset identifier to predict on. If ‘’, use full dataset.
dirpath (str, optional (default=None)) – Absolute path to dump dir or relative to ‘project__path’ started with ‘./’. If None, “project__path/results/models” is used.
**kwargs (dict) – Additional kwargs to pass in
dataset.dump_pred(**kwargs)
.
- Returns
res – Unchanged input, for compliance with producer logic.
- Return type
Notes
Resulted filename includes prefix:
workflow_id|pipeline_id|fit_dataset_id|best_score|pipeline_hash|fit_ dataset_hash|predict_dataset_id|predict_dataset_hash|os_type|timestamp
fit_dataset_id = None if pipeline not fitted or hasn`t such attribute. The best_score available only after optimize step(s) if optimizer supported.
-
plot
(res, pipeline_id, dataset_id, metric_id, validator=None, subset_id='train', 'test', plotter=None, **kwargs)[source]¶ Plot metrics.
- Parameters
res (dict) – For compliance with producer logic.
pipeline_id (str) – Pipeline identifier in
objects
.dataset_id (str) – Dataset identifier in
objects
.subset_id (str,list/tuple of str, optional (default=('train', 'test'))) – Data subset(s) identifier(s) to plot on. Set ‘’ for full dataset.
metric_id (srt, list/tuple of str) – Metric(s) identifier in objects.
validator (
mlshell.model_selection.Validator
, optional) –(default=None) – Auto initialized if class provided. If None,
mlshell.model_selection.Validator
.plotter (
mlshell.plot.Plotter
, optional (default=None)) – Auto initialized if class provided. If None,mlshell.plot.Plotter
.**kwargs (dict) – Additional kwargs to pass in
plotter.plot(**kwargs)
.
- Returns
res – Unchanged input, for compliance with producer logic.
- Return type
See also
mlshell.plot.Plotter
Metric plotter.
-
dict_api
(obj, method='update', **kwargs)¶ Forwarding api for dictionary object.
Could be useful to add/pop keys via configuration steps. For example to proceed update: (‘dict_api’, {‘b’:7} )
-
dump_cache
(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶ Dump intermediate object state to IO.
- Parameters
obj (picklable) – Object to dump.
prefix (str, optional (default=None)) – File identifier, added to filename. If None, ‘self.oid’ is used.
cachedir (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project_path’ started with ‘./’. Created, if not exists. If None, “sproject_path/ .temp/objects” is used.
pkg (str, optional (default='pickle')) – Import package and try
pkg
.dump(obj, file, **kwargs).**kwargs (kwargs) – Additional parameters to pass in .dump().
- Returns
obj – Unchanged input for compliance with producer logic.
- Return type
picklable
-
load_cache
(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶ Load intermediate object state from IO.
- Parameters
obj (picklable) – Object template, for producer logic only (ignored).
prefix (str, optional (default=None)) – File identifier. If None, ‘self.oid’ is used.
pkg (str, optional default('pickle')) – Import package and try obj =
pkg
.load(file, **kwargs).cachedir (str, optional(default=None)) – Absolute path to load dir or relative to ‘project_path’ started with ‘./’. If None, ‘project_path/.temp/objects’ is used.
**kwargs (kwargs) – Additional parameters to pass in .load().
- Returns
obj – Loaded cache.
- Return type
picklable object
-
run
(init, steps)¶ Execute configuration steps.
Consecutive call (with decorators):
init = getattr(self, 'method_id')(init, objects=objects, **kwargs)
- Parameters
init (object) – Will be passed as arg in each step and get back as result.
steps (list of tuples) – List of
self
methods to run consecutive with kwargs: (‘method_id’, kwargs, decorators ).
- Returns
configs – List of configurations, prepared for execution: [(‘section_id__config__id’, config), …].
- Return type
list of tuple
Notes
Object identifier
oid
auto added, if produced object hasoid
attribute.