mlshell.DatasetProducer¶
-
class
mlshell.
DatasetProducer
(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶ Bases:
pycnfg.producer.Producer
,mlshell.producers.dataset.DataIO
,mlshell.producers.dataset.DataPreprocessor
Factory to produce dataset.
- Parameters
objects (dict) – Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.
oid (str) – Unique identifier of produced object.
path_id (str, optional (default='default')) – Project path identifier in objects.
logger_id (str, optional (default='default')) – Logger identifier in objects.
-
objects
¶ Dictionary with objects from previous executed producers: {‘section_id__config__id’, object,}.
- Type
-
logger
¶ Logger.
- Type
Methods
dict_api
(obj[, method])Forwarding api for dictionary object.
dump_cache
(obj[, prefix, cachedir, pkg])Dump intermediate object state to IO.
info
(dataset, **kwargs)Log dataset info.
load
(dataset, filepath[, key, random_skip, …])Load data from csv-file.
load_cache
(obj[, prefix, cachedir, pkg])Load intermediate object state from IO.
preprocess
(dataset, targets_names[, …])Preprocess raw data.
run
(init, steps)Execute configuration steps.
split
(dataset, **kwargs)Split dataset on train, test.
update
(obj, items)Update key(s) for dictionary object.
-
__init__
(objects, oid, path_id='path__default', logger_id='logger__default')[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(objects, oid[, path_id, logger_id])Initialize self.
dict_api
(obj[, method])Forwarding api for dictionary object.
dump_cache
(obj[, prefix, cachedir, pkg])Dump intermediate object state to IO.
info
(dataset, **kwargs)Log dataset info.
load
(dataset, filepath[, key, random_skip, …])Load data from csv-file.
load_cache
(obj[, prefix, cachedir, pkg])Load intermediate object state from IO.
preprocess
(dataset, targets_names[, …])Preprocess raw data.
run
(init, steps)Execute configuration steps.
split
(dataset, **kwargs)Split dataset on train, test.
update
(obj, items)Update key(s) for dictionary object.
-
dict_api
(obj, method='update', **kwargs)¶ Forwarding api for dictionary object.
Could be useful to add/pop keys via configuration steps. For example to proceed update: (‘dict_api’, {‘b’:7} )
-
dump_cache
(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶ Dump intermediate object state to IO.
- Parameters
obj (picklable) – Object to dump.
prefix (str, optional (default=None)) – File identifier, added to filename. If None, ‘self.oid’ is used.
cachedir (str, optional(default=None)) – Absolute path to dump dir or relative to ‘project_path’ started with ‘./’. Created, if not exists. If None, “sproject_path/ .temp/objects” is used.
pkg (str, optional (default='pickle')) – Import package and try
pkg
.dump(obj, file, **kwargs).**kwargs (kwargs) – Additional parameters to pass in .dump().
- Returns
obj – Unchanged input for compliance with producer logic.
- Return type
picklable
-
info
(dataset, **kwargs)¶ Log dataset info.
Check:
duplicates.
gaps.
- Parameters
dataset (
mlshell.Dataset
) – Dataset to explore.**kwargs (dict) – Additional parameters to pass in low-level functions.
- Returns
dataset – For compliance with producer logic.
- Return type
-
load
(dataset, filepath, key='data', random_skip=False, random_state=None, **kwargs)¶ Load data from csv-file.
- Parameters
dataset (
mlshell.Dataset
) – Template for dataset.filepath (str) – Absolute path to csv file or relative to ‘project__path’ started with ‘./’.
key (str, optional (default='data')) – Loaded data identifier to add in dataset dictionary. Useful when load multiple files and combine them in separate step under ‘data’.
random_skip (bool, optional (default=False)) – If True randomly skip rows while read file, remain ‘nrow’ lines. Rewrite skiprows kwarg.
random_state (int, optional (default=None)) – Fix random state for random_skip.
**kwargs (dict) – Additional parameter passed to the
pandas.read_csv()
.
- Returns
dataset (
mlshell.Dataset
) – Key added: {‘data’:pandas.DataFrame
,}.Notes
——
If nrow > lines in file, auto set to None.
-
load_cache
(obj, prefix=None, cachedir=None, pkg='pickle', **kwargs)¶ Load intermediate object state from IO.
- Parameters
obj (picklable) – Object template, for producer logic only (ignored).
prefix (str, optional (default=None)) – File identifier. If None, ‘self.oid’ is used.
pkg (str, optional default('pickle')) – Import package and try obj =
pkg
.load(file, **kwargs).cachedir (str, optional(default=None)) – Absolute path to load dir or relative to ‘project_path’ started with ‘./’. If None, ‘project_path/.temp/objects’ is used.
**kwargs (kwargs) – Additional parameters to pass in .load().
- Returns
obj – Loaded cache.
- Return type
picklable object
-
preprocess
(dataset, targets_names, features_names=None, categor_names=None, pos_labels=None, **kwargs)¶ Preprocess raw data.
- Parameters
dataset (
mlshell.Dataset
) – Raw dataset: {‘data’:pandas.DataFrame
}.targets_names (list) – List of targets columns names in raw dataset. Even if no exist, will be used to name predictions in
dataset.dump_pred
.features_names (list, optional (default=None)) – List of features columns names in raw dataset. If None, all except targets.
categor_names (list, optional (default=None)) – List of categorical features(also binary) identifiers in raw dataset. If None, empty list.
pos_labels (list, optional (default=None)) – Classification only, list of “positive” label(s) in target(s). Could be used in
sklearn.metrics.roc_curve()
for threshold analysis and metrics evaluation if classifier supportspredict_proba
. If None, for each target last label innumpy.unique()
is used . For regression set [] to prevent evaluation.**kwargs (dict) – Additional parameters to add in dataset.
- Returns
dataset – Resulted dataset. Key updated: ‘data’. Keys added:
- ’subsets’: dict
Storage for data subset(s) indices (filled in split method) {‘subset_id’: indices}.
- ’meta’dict
Extracted auxiliary information from data: {
- ‘index’: list
List of index column label(s).
- ’features’: list
List of feature column label(s).
- ’categoric_features’: list
List of categorical feature column label(s).
- ’targets’: list
List of target column label(s),
- ’indices’: list
List of rows indices.
- ’classes’: list of
numpy.ndarray
List of sorted unique labels for each target(s) (n_outputs, n_classes).
- ’pos_labels’: list
List of “positive” label(s) for target(s) (n_outputs,).
- ’pos_labels_ind’: list
List of “positive” label(s) index in
numpy.unique()
for target(s) (n_outputs).- categoric_ind_namedict
Dictionary with categorical feature indices as key, and tuple (‘feature_name’, categories) as value: {‘column_index’: (‘feature_name’, [‘cat1’, ‘cat2’])}.
- numeric_ind_namedict
Dictionary with numeric features indices as key, and tuple (‘feature_name’, ) as value: {‘columns_index’: (‘feature_name’,)}.
}
- Return type
Notes
Don`t change dataframe shape or index/columns names after
meta
generating.Features columns unified:
Fill gaps.
If gap in categorical => set ‘unknown’.
If gap in non-categorical => set np.nan.
Cast categorical features to str dtype, and apply Ordinal encoder.
Cast values to np.float64.
-
run
(init, steps)¶ Execute configuration steps.
Consecutive call (with decorators):
init = getattr(self, 'method_id')(init, objects=objects, **kwargs)
- Parameters
init (object) – Will be passed as arg in each step and get back as result.
steps (list of tuples) – List of
self
methods to run consecutive with kwargs: (‘method_id’, kwargs, decorators ).
- Returns
configs – List of configurations, prepared for execution: [(‘section_id__config__id’, config), …].
- Return type
list of tuple
Notes
Object identifier
oid
auto added, if produced object hasoid
attribute.
-
split
(dataset, **kwargs)¶ Split dataset on train, test.
- Parameters
dataset (
mlshell.Dataset
) – Dataset to unify.**kwargs (dict) – Additional parameters to pass in:
sklearn.model_selection.train_test_split()
.
- Returns
dataset – Resulted dataset. ‘subset’ value updated: {‘train’: array-like train rows indices, ‘test’: array-like test rows indices,}
- Return type
Notes
If split
train_size==1.0
ortest_size==0
:test=train
, other kwargs ignored.No copy takes place.