mlshell.DatasetProducer¶

class mlshell.DatasetProducer(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶

Bases: pycnfg.producer.Producer, mlshell.producers.dataset.DataIO, mlshell.producers.dataset.DataPreprocessor

Factory to produce dataset.

Parameters

objects (dict) – Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.
oid (str) – Unique identifier of produced object.
path_id (str, optional (default='default')) – Project path identifier in objects.
logger_id (str, optional (default='default')) – Logger identifier in objects.

objects¶

Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.

Type: dict

oid¶

Unique identifier of produced object.

Type: str

logger¶

Logger.

Type: logging.Logger

project_path¶

Absolute path to project dir.

Type: str

Methods

`dict_api`(obj[, method])	Forwarding api for dictionary object.
`dump_cache`(obj[, prefix, cachedir, pkg])	Dump intermediate object state to IO.
`info`(dataset, **kwargs)	Log dataset info.
`load`(dataset, filepath[, key, random_skip, …])	Load data from csv-file.
`load_cache`(obj[, prefix, cachedir, pkg])	Load intermediate object state from IO.
`preprocess`(dataset, targets_names[, …])	Preprocess raw data.
`run`(init, steps)	Execute configuration steps.
`split`(dataset, **kwargs)	Split dataset on train, test.
`update`(obj, items)	Update key(s) for dictionary object.

__init__(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(objects, oid[, path_id, logger_id])	Initialize self.
`dict_api`(obj[, method])	Forwarding api for dictionary object.
`dump_cache`(obj[, prefix, cachedir, pkg])	Dump intermediate object state to IO.
`info`(dataset, **kwargs)	Log dataset info.
`load`(dataset, filepath[, key, random_skip, …])	Load data from csv-file.
`load_cache`(obj[, prefix, cachedir, pkg])	Load intermediate object state from IO.
`preprocess`(dataset, targets_names[, …])	Preprocess raw data.
`run`(init, steps)	Execute configuration steps.
`split`(dataset, **kwargs)	Split dataset on train, test.
`update`(obj, items)	Update key(s) for dictionary object.

dict_api(obj, method='update', **kwargs)¶

Forwarding api for dictionary object.

Could be useful to add/pop keys via configuration steps. For example to proceed update: (‘dict_api’, {‘b’:7} )

dump_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶

Dump intermediate object state to IO.

Parameters

obj (picklable) – Object to dump.
prefix (str, optional (default=None)) – File identifier, added to filename. If None, ‘self.oid’ is used.
cachedir (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project_path’ started with ‘./’. Created, if not exists. If None, “sproject_path/ .temp/objects” is used.
pkg (str, optional (default='pickle')) – Import package and try pkg.dump(obj, file, **kwargs).
**kwargs (kwargs) – Additional parameters to pass in .dump().

Returns

obj – Unchanged input for compliance with producer logic.

Return type

picklable

info(dataset, **kwargs)¶

Log dataset info.

Check:

duplicates.
gaps.

Parameters

dataset (mlshell.Dataset) – Dataset to explore.
**kwargs (dict) – Additional parameters to pass in low-level functions.

Returns

dataset – For compliance with producer logic.

Return type

mlshell.Dataset

load(dataset, filepath, key='data', random_skip=False, random_state=None, **kwargs)¶

Load data from csv-file.

Parameters

dataset (mlshell.Dataset) – Template for dataset.
filepath (str) – Absolute path to csv file or relative to ‘project__path’ started with ‘./’.
key (str, optional (default='data')) – Loaded data identifier to add in dataset dictionary. Useful when load multiple files and combine them in separate step under ‘data’.
random_skip (bool, optional (default=False)) – If True randomly skip rows while read file, remain ‘nrow’ lines. Rewrite skiprows kwarg.
random_state (int, optional (default=None)) – Fix random state for random_skip.
**kwargs (dict) – Additional parameter passed to the pandas.read_csv() .

Returns

dataset (mlshell.Dataset) – Key added: {‘data’: pandas.DataFrame ,}.
Notes
——
If nrow > lines in file, auto set to None.

load_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶

Load intermediate object state from IO.

Parameters

obj (picklable) – Object template, for producer logic only (ignored).
prefix (str, optional (default=None)) – File identifier. If None, ‘self.oid’ is used.
pkg (str, optional default('pickle')) – Import package and try obj = pkg.load(file, **kwargs).
cachedir (str, optional(default=None)) – Absolute path to load dir or relative to ‘project_path’ started with ‘./’. If None, ‘project_path/.temp/objects’ is used.
**kwargs (kwargs) – Additional parameters to pass in .load().

Returns

obj – Loaded cache.

Return type

picklable object

preprocess(dataset, targets_names, features_names=None, categor_names=None, pos_labels=None, **kwargs)¶

Preprocess raw data.

Parameters

dataset (mlshell.Dataset) – Raw dataset: {‘data’: pandas.DataFrame }.
targets_names (list) – List of targets columns names in raw dataset. Even if no exist, will be used to name predictions in dataset.dump_pred .
features_names (list, optional (default=None)) – List of features columns names in raw dataset. If None, all except targets.
categor_names (list, optional (default=None)) – List of categorical features(also binary) identifiers in raw dataset. If None, empty list.
pos_labels (list, optional (default=None)) – Classification only, list of “positive” label(s) in target(s). Could be used in sklearn.metrics.roc_curve() for threshold analysis and metrics evaluation if classifier supports predict_proba. If None, for each target last label in numpy.unique() is used . For regression set [] to prevent evaluation.
**kwargs (dict) – Additional parameters to add in dataset.

Returns

dataset – Resulted dataset. Key updated: ‘data’. Keys added:

’subsets’: dict

Storage for data subset(s) indices (filled in split method) {‘subset_id’: indices}.

’meta’dict

Extracted auxiliary information from data: {

‘index’: list: List of index column label(s).
’features’: list: List of feature column label(s).
’categoric_features’: list: List of categorical feature column label(s).
’targets’: list: List of target column label(s),
’indices’: list: List of rows indices.
’classes’: list of numpy.ndarray: List of sorted unique labels for each target(s) (n_outputs, n_classes).
’pos_labels’: list: List of “positive” label(s) for target(s) (n_outputs,).
’pos_labels_ind’: list: List of “positive” label(s) index in numpy.unique() for target(s) (n_outputs).
categoric_ind_namedict: Dictionary with categorical feature indices as key, and tuple (‘feature_name’, categories) as value: {‘column_index’: (‘feature_name’, [‘cat1’, ‘cat2’])}.
numeric_ind_namedict: Dictionary with numeric features indices as key, and tuple (‘feature_name’, ) as value: {‘columns_index’: (‘feature_name’,)}.

}

Return type

mlshell.Dataset

Notes

Don`t change dataframe shape or index/columns names after meta generating.

Features columns unified:

Fill gaps.
- If gap in categorical => set ‘unknown’.
- If gap in non-categorical => set np.nan.
Cast categorical features to str dtype, and apply Ordinal encoder.
Cast values to np.float64.

run(init, steps)¶

Execute configuration steps.

Consecutive call (with decorators):

init = getattr(self, 'method_id')(init, objects=objects, **kwargs)

Parameters

init (object) – Will be passed as arg in each step and get back as result.
steps (list of tuples) – List of self methods to run consecutive with kwargs: (‘method_id’, kwargs, decorators ).

Returns

configs – List of configurations, prepared for execution: [(‘section_id__config__id’, config), …].

Return type

list of tuple

Notes

Object identifier oid auto added, if produced object has oid attribute.

split(dataset, **kwargs)¶

Split dataset on train, test.

Parameters

dataset (mlshell.Dataset) – Dataset to unify.
**kwargs (dict) – Additional parameters to pass in: sklearn.model_selection.train_test_split() .

Returns

dataset – Resulted dataset. ‘subset’ value updated: {‘train’: array-like train rows indices, ‘test’: array-like test rows indices,}

Return type

mlshell.Dataset

Notes

If split train_size==1.0 or test_size==0: test=train , other kwargs ignored.

No copy takes place.

update(obj, items)¶

Update key(s) for dictionary object.

Parameters

obj (dict) – Object to update.
items (dict, list, optional (default=None)) – Either dictionary or items [(key,val),] to update obj.

Returns

obj – Updated input.

Return type

dict