mlshell.Workflow

class mlshell.Workflow(objects, oid, path_id='path__default', logger_id='logger__default')[source]

Bases: pycnfg.producer.Producer

Interface to ML task.

Interface: fit, predict, optimize, validate, dump, plot.

Parameters
  • objects (dict) – Dictionary with resulted objects from previous executed producers: {‘section_id__config__id’, object}.

  • oid (str) – Unique identifier of produced object.

  • path_id (str, optional (default='default')) – Project path identifier in objects.

  • logger_id (str, optional (default='default')) – Logger identifier in objects.

objects

Dictionary with resulted objects from previous executed producers: {‘section_id__config__id’, object,}

Type

dict

oid

Unique identifier of produced object.

Type

str

logger

Logger.

Type

logging.Logger

project_path

Absolute path to project dir.

Type

str

See also

mlshell.Dataset

Dataset interface.

mlshell.Metric

Metric inteface.

mlshell.Pipeline

Pipeline inteface.

Methods

dict_api(obj[, method])

Forwarding api for dictionary object.

dump(res, pipeline_id[, dirpath])

Dump pipeline.

dump_cache(obj[, prefix, cachedir, pkg])

Dump intermediate object state to IO.

fit(res, pipeline_id, dataset_id[, …])

Fit pipeline.

load_cache(obj[, prefix, cachedir, pkg])

Load intermediate object state from IO.

optimize(res, pipeline_id, dataset_id[, …])

Optimize pipeline.

plot(res, pipeline_id, dataset_id, metric_id)

Plot metrics.

predict(res, pipeline_id, dataset_id[, …])

Make and dump prediction.

run(init, steps)

Execute configuration steps.

update(obj, items)

Update key(s) for dictionary object.

validate(res, pipeline_id, dataset_id, metric_id)

Make and score prediction.

__init__(objects, oid, path_id='path__default', logger_id='logger__default')[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(objects, oid[, path_id, logger_id])

Initialize self.

dict_api(obj[, method])

Forwarding api for dictionary object.

dump(res, pipeline_id[, dirpath])

Dump pipeline.

dump_cache(obj[, prefix, cachedir, pkg])

Dump intermediate object state to IO.

fit(res, pipeline_id, dataset_id[, …])

Fit pipeline.

load_cache(obj[, prefix, cachedir, pkg])

Load intermediate object state from IO.

optimize(res, pipeline_id, dataset_id[, …])

Optimize pipeline.

plot(res, pipeline_id, dataset_id, metric_id)

Plot metrics.

predict(res, pipeline_id, dataset_id[, …])

Make and dump prediction.

run(init, steps)

Execute configuration steps.

update(obj, items)

Update key(s) for dictionary object.

validate(res, pipeline_id, dataset_id, metric_id)

Make and score prediction.

fit(res, pipeline_id, dataset_id, subset_id='train', hp=None, resolver=None, resolve_params=None, fit_params=None)[source]

Fit pipeline.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects. Will be fitted on dataset_id__ subset_id: pipeline.fit(subset.x, subset.y, **fit_params) .

  • dataset_id (str) – Dataset identifier in objects.

  • subset_id (str, optional (default='train')) – Data subset identifier to fit on. If ‘’, use full dataset.

  • hp (dict, optional (default=None)) – Hyper-parameters to use in pipeline: {hp_name: val/container}. If range provided for any hp, zero position will be used. If None, {}.

  • resolver (mlshell.model_selection.Resolver, optional) – (default=None) If hp value = ‘auto’, hp will be resolved: resolver.resolve(). Auto initialized if necessary. mlshell.model_selection. Resolver if None.

  • resolve_params (dict, optional (default=None)) – Additional kwargs to pass in: resolver.resolve(*args, **resolve_params[hp_name]). If None, {}.

  • fit_params (dict, optional (default=None)) – Additional kwargs to pass in pipeline.fit(*args, **fit_params). If None, {}.

Returns

res – Unchanged input, for compliance with producer logic.

Return type

dict

Notes

Pipeline updated in objects attribute.

See also

mlshell.model_selection.Resolver

Hp resolver.

optimize(res, pipeline_id, dataset_id, subset_id='train', metric_id=None, hp_grid=None, resolver=None, optimizer=None, dirpath=None, resolve_params=None, fit_params=None, gs_params=None, dump_params=None)[source]

Optimize pipeline.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects. Will be cross-validate on dataset_id__subset_id: optimizer.fit(subset.x, subset.y, **fit_params) .

  • dataset_id (str) – Dataset identifier in objects.

  • subset_id (str, optional (default='train')) – Data subset identifier to CV on. If ‘’, use full dataset.

  • metric_id (str, List/tuple of str, optional (default=None)) – List of ‘metric_id’ to use in optimizer scoring. Known ‘metric_id’ will be resolved via objects or sklearn built-in, otherwise raise KeyError. If None, ‘accuracy’ or ‘r2’ depends on pipeline estimator type.

  • hp_grid (dict, optional (default=None)) – Hyper-parameters to grid search: {hp_name: optimizer format}. If None, {}.

  • resolver (mlshell.model_selection.Resolver, optional) – (default=None) If hp value = [‘auto’] in hp_grid, hp will be resolved via resolver.resolve(). Auto initialized if class provided. If None, mlshell.model_selection.Resolver used.

  • optimizer (mlshell.model_selection.Optimizer`, optional) – (default=None) Class to optimize hp_grid. Will be called optimizer(pipeline, hp_grid, scoring, **gs_params).fit(x, y, **fit_params). If None, mlshell.model_selection.RandomizedSearchOptimizer .

  • dirpath (str, optional (default=None)) – Absolute path to the dump result ‘runs’ dir or relative to ‘project__path’ started with ‘./’. If None, “project__path /results/runs” is used. See Notes for runs description.

  • resolve_params (dict, optional (default=None)) – Additional kwargs to pass in resolver.resolve(*args, **resolve_params[hp_name]) . If None, {}.

  • fit_params (dict, optional (default=None)) – Additional kwargs to pass in optimizer.fit(*args, **fit_params). If None, {}.

  • gs_params (dict, optional (default=None)) – Additional kwargs to optimizer(pipeline, hp_grid, scoring, **gs_params) initialization. If None, {}.

  • dump_params (dict, optional (default=None)) – Additional kwargs to pass in optimizer.dump_runs(**dump_params). If None, {}.

Returns

res – Input`s key added/updated: {

‘runs’: dict

Storage of optimization results for pipeline-data pair. {‘pipeline_id|dataset_id__subset_id’:

optimizer.update_best output}

}

Return type

dict

Notes

Optimization flow:

  • Call grid search.

optimizer(pipeline.pipeline, hp_grid, scoring, **gs_params) .fit(x, y, **fit_params) .

  • Call dump runs.

optimizer.dump_runs(logger, dirpath, **dump_params), where each run = probing one hp combination.

  • Combine optimization results with previous for pipeline-data pair:

optimizer.update_best(prev_runs) .

  • Upfate pipeline object in objects.

Onle if ‘best_estimator_’ in ‘runs’.

validate(res, pipeline_id, dataset_id, metric_id, subset_id='train', 'test', validator=None)[source]

Make and score prediction.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects. Will be validated on dataset_id__subset_id.

  • dataset_id (str) – Dataset identifier in objects.

  • subset_id (str,list/tuple of str, optional (default=('train', 'test'))) – Data subset(s) identifier(s) to validate on. ‘’ for full dataset.

  • metric_id (srt, list/tuple of str) – Metric(s) identifier in objects.

  • validator (mlshell.model_selection.Validator, optional) –

  • (default=None) – Auto initialized if class provided. If None, mlshell.model_selection.Validator .

Returns

res – Unchanged input, for compliance with producer logic.

Return type

dict

dump(res, pipeline_id, dirpath=None, **kwargs)[source]

Dump pipeline.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects. Will be dumped via pipeline.dump(**kwargs) .

  • dirpath (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project__path’ started with ‘./’. If None,”project__path/results/models” is used.

  • **kwargs (dict) – Additional kwargs to pass in pipeline.dump(**kwargs) .

Returns

res – Unchanged input, for compliance with producer logic.

Return type

dict

Notes

Resulted filename includes prefix: workflow_id|pipeline_id|fit_dataset_id|best_score|pipeline_hash| fit_dataset_hash|os_type|timestamp.

fit_dataset_id = None if pipeline not fitted or hasn`t such attribute. The ‘best_score’ available after optimize step(s) only if optimizer supported.

predict(res, pipeline_id, dataset_id, subset_id='', dirpath=None, **kwargs)[source]

Make and dump prediction.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects to make prediction on dataset_id__subset_id: pipeline.predict(subset.x) .

  • dataset_id (str) – Dataset identifier in objects.

  • subset_id (str, optional (default='test')) – Data subset identifier to predict on. If ‘’, use full dataset.

  • dirpath (str, optional (default=None)) – Absolute path to dump dir or relative to ‘project__path’ started with ‘./’. If None, “project__path/results/models” is used.

  • **kwargs (dict) – Additional kwargs to pass in dataset.dump_pred(**kwargs) .

Returns

res – Unchanged input, for compliance with producer logic.

Return type

dict

Notes

Resulted filename includes prefix: workflow_id|pipeline_id|fit_dataset_id|best_score|pipeline_hash|fit_ dataset_hash|predict_dataset_id|predict_dataset_hash|os_type|timestamp

fit_dataset_id = None if pipeline not fitted or hasn`t such attribute. The best_score available only after optimize step(s) if optimizer supported.

plot(res, pipeline_id, dataset_id, metric_id, validator=None, subset_id='train', 'test', plotter=None, **kwargs)[source]

Plot metrics.

Parameters
  • res (dict) – For compliance with producer logic.

  • pipeline_id (str) – Pipeline identifier in objects.

  • dataset_id (str) – Dataset identifier in objects.

  • subset_id (str,list/tuple of str, optional (default=('train', 'test'))) – Data subset(s) identifier(s) to plot on. Set ‘’ for full dataset.

  • metric_id (srt, list/tuple of str) – Metric(s) identifier in objects.

  • validator (mlshell.model_selection.Validator, optional) –

  • (default=None) – Auto initialized if class provided. If None, mlshell.model_selection.Validator .

  • plotter (mlshell.plot.Plotter, optional (default=None)) – Auto initialized if class provided. If None, mlshell.plot.Plotter .

  • **kwargs (dict) – Additional kwargs to pass in plotter.plot(**kwargs) .

Returns

res – Unchanged input, for compliance with producer logic.

Return type

dict

See also

mlshell.plot.Plotter

Metric plotter.

dict_api(obj, method='update', **kwargs)

Forwarding api for dictionary object.

Could be useful to add/pop keys via configuration steps. For example to proceed update: (‘dict_api’, {‘b’:7} )

dump_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)

Dump intermediate object state to IO.

Parameters
  • obj (picklable) – Object to dump.

  • prefix (str, optional (default=None)) – File identifier, added to filename. If None, ‘self.oid’ is used.

  • cachedir (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project_path’ started with ‘./’. Created, if not exists. If None, “sproject_path/ .temp/objects” is used.

  • pkg (str, optional (default='pickle')) – Import package and try pkg.dump(obj, file, **kwargs).

  • **kwargs (kwargs) – Additional parameters to pass in .dump().

Returns

obj – Unchanged input for compliance with producer logic.

Return type

picklable

load_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)

Load intermediate object state from IO.

Parameters
  • obj (picklable) – Object template, for producer logic only (ignored).

  • prefix (str, optional (default=None)) – File identifier. If None, ‘self.oid’ is used.

  • pkg (str, optional default('pickle')) – Import package and try obj = pkg.load(file, **kwargs).

  • cachedir (str, optional(default=None)) – Absolute path to load dir or relative to ‘project_path’ started with ‘./’. If None, ‘project_path/.temp/objects’ is used.

  • **kwargs (kwargs) – Additional parameters to pass in .load().

Returns

obj – Loaded cache.

Return type

picklable object

run(init, steps)

Execute configuration steps.

Consecutive call (with decorators):

init = getattr(self, 'method_id')(init, objects=objects, **kwargs)

Parameters
  • init (object) – Will be passed as arg in each step and get back as result.

  • steps (list of tuples) – List of self methods to run consecutive with kwargs: (‘method_id’, kwargs, decorators ).

Returns

configs – List of configurations, prepared for execution: [(‘section_id__config__id’, config), …].

Return type

list of tuple

Notes

Object identifier oid auto added, if produced object has oid attribute.

update(obj, items)

Update key(s) for dictionary object.

Parameters
  • obj (dict) – Object to update.

  • items (dict, list, optional (default=None)) – Either dictionary or items [(key,val),] to update obj.

Returns

obj – Updated input.

Return type

dict