mlshell.DatasetProducer

class mlshell.DatasetProducer(objects, oid, path_id='path__default', logger_id='logger__default')[source]

Bases: pycnfg.producer.Producer, mlshell.producers.dataset.DataIO, mlshell.producers.dataset.DataPreprocessor

Factory to produce dataset.

Parameters
  • objects (dict) – Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.

  • oid (str) – Unique identifier of produced object.

  • path_id (str, optional (default='default')) – Project path identifier in objects.

  • logger_id (str, optional (default='default')) – Logger identifier in objects.

objects

Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.

Type

dict

oid

Unique identifier of produced object.

Type

str

logger

Logger.

Type

logging.Logger

project_path

Absolute path to project dir.

Type

str

Methods

dict_api(obj[, method])

Forwarding api for dictionary object.

dump_cache(obj[, prefix, cachedir, pkg])

Dump intermediate object state to IO.

info(dataset, **kwargs)

Log dataset info.

load(dataset, filepath[, key, random_skip, …])

Load data from csv-file.

load_cache(obj[, prefix, cachedir, pkg])

Load intermediate object state from IO.

preprocess(dataset, targets_names[, …])

Preprocess raw data.

run(init, steps)

Execute configuration steps.

split(dataset, **kwargs)

Split dataset on train, test.

update(obj, items)

Update key(s) for dictionary object.

__init__(objects, oid, path_id='path__default', logger_id='logger__default')[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(objects, oid[, path_id, logger_id])

Initialize self.

dict_api(obj[, method])

Forwarding api for dictionary object.

dump_cache(obj[, prefix, cachedir, pkg])

Dump intermediate object state to IO.

info(dataset, **kwargs)

Log dataset info.

load(dataset, filepath[, key, random_skip, …])

Load data from csv-file.

load_cache(obj[, prefix, cachedir, pkg])

Load intermediate object state from IO.

preprocess(dataset, targets_names[, …])

Preprocess raw data.

run(init, steps)

Execute configuration steps.

split(dataset, **kwargs)

Split dataset on train, test.

update(obj, items)

Update key(s) for dictionary object.

dict_api(obj, method='update', **kwargs)

Forwarding api for dictionary object.

Could be useful to add/pop keys via configuration steps. For example to proceed update: (‘dict_api’, {‘b’:7} )

dump_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)

Dump intermediate object state to IO.

Parameters
  • obj (picklable) – Object to dump.

  • prefix (str, optional (default=None)) – File identifier, added to filename. If None, ‘self.oid’ is used.

  • cachedir (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project_path’ started with ‘./’. Created, if not exists. If None, “sproject_path/ .temp/objects” is used.

  • pkg (str, optional (default='pickle')) – Import package and try pkg.dump(obj, file, **kwargs).

  • **kwargs (kwargs) – Additional parameters to pass in .dump().

Returns

obj – Unchanged input for compliance with producer logic.

Return type

picklable

info(dataset, **kwargs)

Log dataset info.

Check:

  • duplicates.

  • gaps.

Parameters
  • dataset (mlshell.Dataset) – Dataset to explore.

  • **kwargs (dict) – Additional parameters to pass in low-level functions.

Returns

dataset – For compliance with producer logic.

Return type

mlshell.Dataset

load(dataset, filepath, key='data', random_skip=False, random_state=None, **kwargs)

Load data from csv-file.

Parameters
  • dataset (mlshell.Dataset) – Template for dataset.

  • filepath (str) – Absolute path to csv file or relative to ‘project__path’ started with ‘./’.

  • key (str, optional (default='data')) – Loaded data identifier to add in dataset dictionary. Useful when load multiple files and combine them in separate step under ‘data’.

  • random_skip (bool, optional (default=False)) – If True randomly skip rows while read file, remain ‘nrow’ lines. Rewrite skiprows kwarg.

  • random_state (int, optional (default=None)) – Fix random state for random_skip.

  • **kwargs (dict) – Additional parameter passed to the pandas.read_csv() .

Returns

load_cache(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)

Load intermediate object state from IO.

Parameters
  • obj (picklable) – Object template, for producer logic only (ignored).

  • prefix (str, optional (default=None)) – File identifier. If None, ‘self.oid’ is used.

  • pkg (str, optional default('pickle')) – Import package and try obj = pkg.load(file, **kwargs).

  • cachedir (str, optional(default=None)) – Absolute path to load dir or relative to ‘project_path’ started with ‘./’. If None, ‘project_path/.temp/objects’ is used.

  • **kwargs (kwargs) – Additional parameters to pass in .load().

Returns

obj – Loaded cache.

Return type

picklable object

preprocess(dataset, targets_names, features_names=None, categor_names=None, pos_labels=None, **kwargs)

Preprocess raw data.

Parameters
  • dataset (mlshell.Dataset) – Raw dataset: {‘data’: pandas.DataFrame }.

  • targets_names (list) – List of targets columns names in raw dataset. Even if no exist, will be used to name predictions in dataset.dump_pred .

  • features_names (list, optional (default=None)) – List of features columns names in raw dataset. If None, all except targets.

  • categor_names (list, optional (default=None)) – List of categorical features(also binary) identifiers in raw dataset. If None, empty list.

  • pos_labels (list, optional (default=None)) – Classification only, list of “positive” label(s) in target(s). Could be used in sklearn.metrics.roc_curve() for threshold analysis and metrics evaluation if classifier supports predict_proba. If None, for each target last label in numpy.unique() is used . For regression set [] to prevent evaluation.

  • **kwargs (dict) – Additional parameters to add in dataset.

Returns

dataset – Resulted dataset. Key updated: ‘data’. Keys added:

’subsets’: dict

Storage for data subset(s) indices (filled in split method) {‘subset_id’: indices}.

’meta’dict

Extracted auxiliary information from data: {

‘index’: list

List of index column label(s).

’features’: list

List of feature column label(s).

’categoric_features’: list

List of categorical feature column label(s).

’targets’: list

List of target column label(s),

’indices’: list

List of rows indices.

’classes’: list of numpy.ndarray

List of sorted unique labels for each target(s) (n_outputs, n_classes).

’pos_labels’: list

List of “positive” label(s) for target(s) (n_outputs,).

’pos_labels_ind’: list

List of “positive” label(s) index in numpy.unique() for target(s) (n_outputs).

categoric_ind_namedict

Dictionary with categorical feature indices as key, and tuple (‘feature_name’, categories) as value: {‘column_index’: (‘feature_name’, [‘cat1’, ‘cat2’])}.

numeric_ind_namedict

Dictionary with numeric features indices as key, and tuple (‘feature_name’, ) as value: {‘columns_index’: (‘feature_name’,)}.

}

Return type

mlshell.Dataset

Notes

Don`t change dataframe shape or index/columns names after meta generating.

Features columns unified:

  • Fill gaps.

    • If gap in categorical => set ‘unknown’.

    • If gap in non-categorical => set np.nan.

  • Cast categorical features to str dtype, and apply Ordinal encoder.

  • Cast values to np.float64.

run(init, steps)

Execute configuration steps.

Consecutive call (with decorators):

init = getattr(self, 'method_id')(init, objects=objects, **kwargs)

Parameters
  • init (object) – Will be passed as arg in each step and get back as result.

  • steps (list of tuples) – List of self methods to run consecutive with kwargs: (‘method_id’, kwargs, decorators ).

Returns

configs – List of configurations, prepared for execution: [(‘section_id__config__id’, config), …].

Return type

list of tuple

Notes

Object identifier oid auto added, if produced object has oid attribute.

split(dataset, **kwargs)

Split dataset on train, test.

Parameters
Returns

dataset – Resulted dataset. ‘subset’ value updated: {‘train’: array-like train rows indices, ‘test’: array-like test rows indices,}

Return type

mlshell.Dataset

Notes

If split train_size==1.0 or test_size==0: test=train , other kwargs ignored.

No copy takes place.

update(obj, items)

Update key(s) for dictionary object.

Parameters
  • obj (dict) – Object to update.

  • items (dict, list, optional (default=None)) – Either dictionary or items [(key,val),] to update obj.

Returns

obj – Updated input.

Return type

dict