# Code Documentation¶

## Evolutionary Optimization Algorithm¶

class eoa.EOA(population, fitness, **kwargs)[source]

This is a base class acting as an umbrella to process an evolutionary optimization algorithm.

Parameters: population – The whole possible population as a list fitness – The fitness evaluation. Accepts an OrderedDict of individuals with their corresponding fitness and updates their fitness init_pop – default=UniformRand; The python class that initiates the initial population recomb – default=UniformCrossover; The python class that defines how to combine parents to produce children mutation – default=Mutation; The python class that performs mutation on offspring population termination – default=MaxGenTermination; The python class that determines the termination criterion elitism – default=Elites; The python class that decides how to handel elitism num_parents – The size of initial parents population parents_porp – default=0.1; The size of initial parents population given as a portion of whole population (only used if num_parents is not given) elits_porp – default=0.2; The porportion of offspring to be replaced by elite parents mutation_prob – The probability that a component will be mutated (default: 0.05) kwargs –
class eoa.MaxGenTermination(**kwargs)[source]

Termination condition: Whether the maximum number of generations has been reached or not

class eoa.UniformCrossover(**kwargs)[source]

Recombination procedure.

class eoa.UniformRand(**kwargs)[source]

Initial population initiation.

## Hilbert Space based regression¶

exception NpyProximation.Error(*args)[source]

Generic errors that may occur in the course of a run.

class NpyProximation.FunctionBasis[source]

This class generates two typical basis of functions: Polynomials and Trigonometric

static Fourier(n, deg, l=1.0)[source]

Returns the Fourier basis of degree deg in n variables with period l

Parameters: n – number of variables deg – the maximum degree of trigonometric combinations in the basis l – the period the raw basis consists of trigonometric functions of degrees up to n
static Poly(n, deg)[source]

Returns a basis consisting of polynomials in n variables of degree at most deg.

Parameters: n – number of variables deg – highest degree of polynomials in the basis the raw basis consists of polynomials of degrees up to n
class NpyProximation.FunctionSpace(dim=1, measure=None, basis=None)[source]

A class tha facilitates a few types of computations over function spaces of type $$L_2(X, \mu)$$

Parameters: dim – the dimension of ‘X’ (default: 1) measure – an object of type Measure representing $$\mu$$ basis – a finite basis of functions to construct a subset of $$L_2(X, \mu)$$
FormBasis()[source]

Call this method to generate the orthogonal basis corresponding to the given basis. The result will be stored in a property called OrthBase which is a list of function that are orthogonal to each other with respect to the measure measure over the given range domain.

Series(f)[source]

Given a function f, this method finds and returns the coefficients of the series that approximates f as a linear combination of the elements of the orthogonal basis $$B$$. In symbols $$\sum_{b\in B}\langle f, b\rangle b$$.

Returns: the list of coefficients $$\langle f, b\rangle$$ for $$b\in B$$
inner(f, g)[source]

Computes the inner product of the two parameters with respect to the measure measure, i.e., $$\int_Xf\cdot g d\mu$$.

Parameters: f – callable g – callable the quantity of $$\int_Xf\cdot g d\mu$$
project(f, g)[source]

Finds the projection of f on g with respect to the inner product induced by the measure measure.

Parameters: f – callable g – callable the quantity of $$\frac{\langle f, g\rangle}{\|g\|_2}g$$
class NpyProximation.HilbertRegressor(deg=3, base=None, meas=None, fspace=None)[source]

Regression using Hilbert Space techniques Scikit-Learn style.

Parameters: deg – int, default=3 The degree of polynomial regression. Only used if base is None base – list, default = None a list of function to form an orthogonal function basis meas – NpyProximation.Measure, default = None the measure to form the $$L_2(\mu)$$ space. If None a discrete measure will be constructed based on fit inputs fspace – NpyProximation.FunctionBasis, default = None the function subspace of $$L_2(\mu)$$, if None it will be initiated according to self.meas
fit(X, y)[source]
Parameters: X – Training data y – Target values self
predict(X)[source]

Predict using the Hilbert regression method

Parameters: X – Samples Returns predicted values
class NpyProximation.Measure(density=None, domain=None)[source]

Constructs a measure $$\mu$$ based on density and domain.

Parameters: density – the density over the domain: + if none is given, it assumes uniform distribution if a callable h is given, then $$d\mu=h(x)dx$$ if a dictionary is given, then $$\mu=\sum w_x\delta_x$$ a discrete measure. The points $$x$$ are the keys of the dictionary (tuples) and the weights $$w_x$$ are the values. domain – if density is a dictionary, it will be set by its keys. If callable, then domain must be a list of tuples defining the domain’s box. If None is given, it will be set to $$[-1, 1]^n$$
integral(f)[source]

Calculates $$\int_{domain} fd\mu$$.

Parameters: f – the integrand the value of the integral
norm(p, f)[source]

Computes the norm-p of the f with respect to the current measure, i.e., $$(\int_{domain}|f|^p d\mu)^{1/p}$$.

Parameters: p – a positive real number f – the function whose norm is desired. $$\|f\|_{p, \mu}$$
class NpyProximation.Regression(points, dim=None)[source]

Given a set of points, i.e., a list of tuples of the equal lengths P, this class computes the best approximation of a function that fits the data, in the following sense:

• if no extra parameters is provided, meaning that an object is initiated like R = Regression(P) then calling R.fit() returns the linear regression that fits the data.
• if at initiation the parameter deg=n is set, then R.fit() returns the polynomial regression of degree n.
• if a basis of functions provided by means of an OrthSystem object (R.SetOrthSys(orth)) then calling R.fit() returns the best approximation that can be found using the basic functions of the orth object.
Parameters: points – a list of points to be fitted or a callable to be approximated dim – dimension of the domain
SetFuncSpc(sys)[source]

Sets the bases of the orthogonal basis

Parameters: sys – orthsys.OrthSystem object. None

Note

For technical reasons, the measure needs to be given via SetMeasure method. Otherwise, the Lebesque measure on $$[-1, 1]^n$$ is assumed.

SetMeasure(meas)[source]

Sets the default measure for approximation.

Parameters: meas – a measure.Measure object None
fit()[source]

Fits the best curve based on the optional provided orthogonal basis. If no basis is provided, it fits a polynomial of a given degree (at initiation) :return: The fit.

## Sensitivity Analysis¶

Sensitivity analysis of a dataset based on a fit, sklearn style. The core functionality is provided by SALib .

class sensapprx.CorrelationThreshold(threshold=0.7)[source]

Selects a minimal set of features based on a given (Pearson) correlation threshold. The transformer omits the maximum number features with a high correlation and makes sure that the remaining features are not correlated behind the given threshold.

Parameters: threshold – the threshold for selecting correlated pairs.
fit(X, y=None)[source]

Finds the Pearson correlation among all features, selects the pairs with absolute value of correlation above the given threshold and selects a minimal set of features with low correlation

Parameters: X – Training data y – Target values (default: None) self
fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters: X – numpy array of shape [n_samples, n_features]; Training set. y – numpy array of shape [n_samples]; Target values. Transformed array
class sensapprx.SensAprx(n_features_to_select=10, regressor=None, method='sobol', margin=0.2, num_smpl=500, num_levels=5, grid_jump=1, num_resmpl=8, reduce=False, domain=None, probs=None)[source]

Transform data to select the most secretive factors according to a regressor that fits the data.

Parameters: n_features_to_select – int number of top features to be selected regressor – a sklearn style regressor to fit the data for sensitivity analysis method – str the sensitivity analysis method; defalt ‘sobol’, other options are ‘morris’ and ‘delta-mmnt’ margin – domain margine, default: .2 num_smpl – number of samples to perform the analysis, default: 1000 num_levels – number of levels for morris analysis, default: 6 grid_jump – grid jump for morris analysis, default: 1 num_resmpl – number of resamples for moment independent analysis, default: 10 reduce – whether to reduce the data points to uniques and calculate the averages of the target or not, default: False domain – pre-calculated unique points, if none, and reduce is True then unique points will be found probs – pre-calculated values associated to domain points
fit(X, y)[source]

Fits the regressor to the data (X, y) and performs a sensitivity analysis on the result of the regression.

Parameters: X – Training data y – Target values self
fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters: X – numpy array of shape [n_samples, n_features]; Training set. y – numpy array of shape [n_samples]; Target values. Transformed array

## Optimized Pipeline Detector¶

class aml.AML(config=None, length=5, scoring='accuracy', cat_cols=None, surrogates=None, min_random_evals=15, cv=None, check_point='./', stack_res=True, stack_probs=True, stack_decision=True, verbose=1, n_jobs=-1)[source]

A class that accepts a nested dictionary with machine learning libraries as its keys and a dictionary of their parameters and their ranges as value of each key and finds an optimum combination based on training data.

Parameters: config – A dictionary whose keys are scikit-learn-style objects (as strings) and its corresponding values are dictionaries of the parameters and their acceptable ranges/values length – default=5; Maximum number of objects in generated pipelines scoring – default=’accuracy’; The scoring method to be optimized. Must follow the sklearn scoring signature cat_cols – default=None; The list of indices of categorical columns surrogates – default=None; A list of 4-tuples determining surrogates. The first entity of each pair is a scikit-learn regressor and the 2nd entity is the number of iterations that this surrogate needs to be estimated and optimized. The 3rd is the sampling strategy and the 4th is the scipy.optimize solver min_random_evals – default=15; Number of randomly sampled initial values for hyper parameters cv – default=ShuffleSplit(n_splits=3, test_size=.25); The cross validation method check_point – default=’./’; The path where the optimization results will be stored stack_res – default=True; StackingEstimators res stack_probs – default=True; StackingEstimators probs stack_decision – default=True; StackingEstimators decision verbose – default=1; Level of output details n_jobs – int, default=-1; number of processes to run in parallel
add_surrogate(estimator, itrs, sampling=None, optim='L-BFGS-B')[source]

Adding a regressor for surrogate optimization procedure.

Parameters: estimator – A scikit-learn style regressor itrs – Number of iterations the estimator needs to be fitted and optimized sampling – default= BoxSample; The sampling strategy (CompactSample, BoxSample or SphereSample) optim – default=’L-BFGS-B’;scipy.optimize solver None
eoa_fit(X, y, **kwargs)[source]

Applies evolutionary optimization methods to find an optimum pipeline

Parameters: X – Training data y – Corresponding observations kwargs – EOA parameters self
fit(X, y)[source]

Generates and optimizes all legitimate pipelines. The best pipeline can be retrieved from self.best_estimator_

Parameters: X – Training data y – Corresponding observations self
get_top(num=5)[source]

Finds the top n pipelines

Parameters: num – Number of pipelines to be returned An OrderedDict of top models
optimize_pipeline(seq, X, y)[source]

Constructs and optimizes a pipeline according to the steps passed through seq which is a tuple of estimators and transformers.

Parameters: seq – the tuple of steps of the pipeline to be optimized X – numpy array of training features y – numpy array of training values the optimized pipeline and its score
types()[source]

Recognizes the type of each estimator to determine legitimate placement of each

Returns: None
class aml.StackingEstimator(estimator, res=True, probs=True, decision=True)[source]

Meta-transformer for adding predictions and/or class probabilities as synthetic feature(s).

Parameters: estimator – object with fit, predict, and predict_proba methods. The estimator to generate synthetic features from. res – True (default), stacks the final result of estimator probs – True (default), stacks probabilities calculated by estimator decision – True (default), stacks the result of decision function of the estimator
fit(X, y=None, **fit_params)[source]

Fit the StackingEstimator meta-transformer.

Parameters: X – array-like of shape (n_samples, n_features). The training input samples. y – array-like, shape (n_samples,). The target values (integers that correspond to classes in classification, real numbers in regression). fit_params – Other estimator-specific parameters. self, object. Returns a copy of the estimator
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

**params : dict
Estimator parameters.
self : estimator instance
Estimator instance.
transform(X)[source]

Transform data by adding two synthetic feature(s).

Parameters: X – numpy ndarray, {n_samples, n_components}. New data, where n_samples is the number of samples and n_components is the number of components. X_transformed: array-like, shape (n_samples, n_features + 1) or (n_samples, n_features + 1 + n_classes) for classifier with predict_proba attribute; The transformed feature set.
class aml.Words(letters, last=None, first=None, repeat=False)[source]

This class takes a set as alphabet and generates words of a given length accordingly. A Words instant accepts the following parameters:

Parameters: letters – is a set of letters (symbols) to make up the words last – a subset of letters that are allowed to appear at the end of a word first – a set of words that can only appear at the beginning of a word repeat – whether consecutive occurrence of a letter is allowed
Generate(l)[source]

Generates the set of legitimate words of length l

Parameters: l – int, the length of words set of all legitimate words of length l

## MLTrace: A machine learning progress tracker¶

This module provides some basic functionality to track the process of machine learning model development. It sets up a SQLite db-file and stores selected models, graphs, and data (for convenience) and recovers them as requested.

mltrace uses peewee and pandas for data manipulation.

It also has built in capabilities to generate some typical plots and graph in machine learning.

class mltrace.Data(*args, **kwargs)[source]

The class to generate the ‘data table in the SQLite db-file. This table stores the whole given data for convenience.

DoesNotExist

alias of DataDoesNotExist

class mltrace.MLModel(*args, **kwargs)[source]

The class to generate the ‘mlmodel table in the SQLite db-file. It stores the scikit-learn scheme of the model/pipeline, its parameters, etc.

DoesNotExist

alias of MLModelDoesNotExist

class mltrace.Metrics(*args, **kwargs)[source]

The class to generate the ‘metrics table in the SQLite db-file. This table stores the calculated metrics of each stored model.

DoesNotExist

alias of MetricsDoesNotExist

class mltrace.Plots(*args, **kwargs)[source]

The class to generate the ‘plots table in the SQLite db-file. This table stores matplotlib plots associated to each model.

DoesNotExist

alias of PlotsDoesNotExist

class mltrace.Saved(*args, **kwargs)[source]

The class to generate the ‘saved table in the SQLite db-file. It keeps the pickled version of a stored model that can be later recovered.

DoesNotExist

alias of SavedDoesNotExist

class mltrace.Task(*args, **kwargs)[source]

The class to generate the ‘task table in the SQLite db-file. This table keeps basic information about the task on hand, e.g., the task name, a brief description, target column, and columns to be ignored.

DoesNotExist

alias of TaskDoesNotExist

class mltrace.Weights(*args, **kwargs)[source]

The class to generate the ‘weights table in the SQLite db-file. Stores some sensitivity measures, correlations, etc.

DoesNotExist

alias of WeightsDoesNotExist

class mltrace.mltrack(task, task_id=None, db_name='mltrack.db', cv=None)[source]

This class instantiates an object that tracks the ML activities and store them upon request.

Parameters: task – ‘str’ the task name task_is – the id of an existing task, if the name is not provided. db_name – a file name for the SQLite database cv – the default cross validation method, must be a valid cv based on sklearn.model_selection; default: ShuffleSplit(n_splits=3, test_size=.25)
FeatureWeights(weights=('pearson', 'variance'), **kwargs)[source]

Calculates the requested weights and log them

Parameters: weights – a list of weights, a subset of {‘pearson’, ‘variance’, ‘relieff’, ‘surf’, ‘sobol’, ‘morris’, ‘delta_mmnt’, ‘info-gain’} kwargs – all input acceptable by skrebate.ReliefF, skrebate.surf, sensapprx.SensAprx None
LoadModel(mid)[source]

Loads a model corresponding to an id

Parameters: mid – the model id an unfitted model
static LoadPlot(pid)[source]

Loads a matplotlib plot

Parameters: pid – the id of the plot a matplotlib figure
LogMetrics(mdl, cv=None)[source]

Logs metrics of an already logged model using a cross validation methpd

Parameters: mdl – the model to be measured cv – cross validation method a dictionary of all measures with their corresponding values for the model
LogModel(mdl, name=None)[source]

Log a machine learning model

Parameters: mdl – a scikit-learn compatible estimator/pipeline name – an arbitrary string to name the model modified instance of mdl which carries a new attribute mltrack_id as its id.
PreserveModel(mdl)[source]

Pickles and preserves an already logged model

Parameters: mdl – a logged model None
RecoverModel(mdl_id)[source]

Recovers a pickled model

Parameters: mdl_id – a valid mltrack_id a fitted model
RegisterData(source_df, target)[source]

Registers a pandas DataFrame into the SQLite database. Upon a call, it also sets self.X and self.y which are numpy arrays.

Parameters: source_df – the pandas DataFrame to be stored target – the name of the target column to be predicted None
TopFeatures(num=10)[source]

Returns num of top features in the data based on calculated weights

Parameters: num – number of top features to return an OrderedDict of top features
UpdateModel(mdl, name)[source]

Updates an already logged model which has mltrack_id set.

Parameters: mdl – a scikit-learn compatible estimator/pipeline name – an arbitrary string to name the model None
UpdateTask(data)[source]

Parameters: data – a dictionary that may include some the followings as its keys: ’name’: the corresponding value will replace the current task name ’description’: the corresponding value will replace the current description ’ignore’: the corresponding value will replace the current ignored columns None
allModels()[source]

Lists all logged models as a pandas DataFrame

Returns: a pandas DataFrame
allPlots(mdl_id)[source]

Lists all stored plots for a model with mdl_id as a pandas DataFrame

Parameters: mdl_id – a valid mltrack_id a pandas DataFrame
allPreserved()[source]

Lists all pickled models as a pandas DataFrame

Returns: a pandas DataFrame
allTasks()[source]

Lists all tasks as a pandas DataFrame

Returns: a pandas DataFrame
static cumulative_gain_curve(y_true, y_score, pos_label=None)[source]

This function generates the points necessary to plot the Cumulative Gain Note: This implementation is restricted to the binary classification task.

Parameters: y_true – (array-like, shape (n_samples)): True labels of the data. y_score – (array-like, shape (n_samples)): Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers). pos_label – (int or str, default=None): Label considered as positive and others are considered negative percentages (numpy.ndarray): An array containing the X-axis values for plotting the Cumulative Gains chart. gains (numpy.ndarray): An array containing the Y-axis values for one curve of the Cumulative Gains chart. ValueError: If y_true is not composed of 2 classes. The Cumulative Gain Chart is only relevant in binary classification.
static getBest(metric)[source]

Finds the model with the best metric.

Parameters: metric – the metric to find the best stored model for the model wiith the best metric
get_data()[source]

Retrieves data in numpy format

Returns: numpy arrays X, y
get_dataframe()[source]

Retrieves data in pandas DataFrame format

Returns: pandas DataFrame containing all data
heatmap(corr_df=None, sort_by=None, ascending=False, font_size=3, cmap='gnuplot2', idx_col='feature', ignore=())[source]

Plots a heatmap from the values of the dataframe corr_df

Parameters: corr_df – value container idx_col – the column whose values will be used as index sort_by – dataframe will be sorted descending by values of this column. If None, the first column is used font_size – font size, defalut 3 cmap – color mapping. Must be one of the followings ’viridis’, ‘plasma’, ‘inferno’, ‘magma’, ‘cividis’, ‘Greys’, ‘Purples’, ’Blues’, ‘Greens’, ‘Oranges’, ‘Reds’, ‘YlOrBr’, ‘YlOrRd’, ‘OrRd’, ‘PuRd’, ’RdPu’, ‘BuPu’, ‘GnBu’, ‘PuBu’, ‘YlGnBu’, ‘PuBuGn’, ‘BuGn’, ‘YlGn’, ’binary’, ‘gist_yarg’, ‘gist_gray’, ‘gray’, ‘bone’, ‘pink’, ‘spring’, ’summer’, ‘autumn’, ‘winter’, ‘cool’, ‘Wistia’, ‘hot’, ‘afmhot’, ’gist_heat’, ‘copper’, ‘PiYG’, ‘PRGn’, ‘BrBG’, ‘PuOr’, ‘RdGy’, ‘RdBu’, ’RdYlBu’, ‘RdYlGn’, ‘Spectral’, ‘coolwarm’, ‘bwr’, ‘seismic’, ‘twilight’, ’twilight_shifted’, ‘hsv’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ’Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ’flag’, ‘prism’, ‘ocean’, ‘gist_earth’, ‘terrain’, ‘gist_stern’, ‘gnuplot’, ’gnuplot2’, ‘CMRmap’, ‘cubehelix’, ‘brg’, ‘gist_rainbow’, ‘rainbow’, ’jet’, ‘nipy_spectral’, ‘gist_ncar’ matplotlib pyplot instance
plot_calibration_curve(mdl, name, fig_index=1, bins=10)[source]

Plots calibration curves.

Parameters: mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation. name – string; Title for the chart. bins – number of bins to partition samples a matplotlib plot
plot_cumulative_gain(mdl, title='Cumulative Gains Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]

Generates the Cumulative Gains Plot from labels and scores/probabilities The cumulative gains chart is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://mlwiki.org/index.php/Cumulative_Gain_Chart. The implementation here works only for binary classification.

Parameters: mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation. title – (string, optional): Title of the generated plot. Defaults to “Cumulative Gains Curve”. figsize – (2-tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6). Defaults to None. title_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g., “small”, “medium”, “large” or integer-values. Defaults to “large”. text_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “medium”. ax (matplotlib.axes.Axes): The axes on which the plot was drawn.
plot_learning_curve(mdl, title, ylim=None, cv=None, n_jobs=1, train_sizes=None, **kwargs)[source]

Generate a simple plot of the test and training learning curve.

Parameters: mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation. title – string; Title for the chart. measure – string, a performance measure; must be one of hte followings: accuracy, f1, precision, recall, roc_auc ylim – tuple, shape (ymin, ymax), optional; Defines minimum and maximum yvalues plotted. cv – int, cross-validation generator or an iterable, optional; Determines the cross-validation splitting strategy. Possible inputs for cv are: None, to use the default 3-fold cross-validation, integer, to specify the number of folds. An object to be used as a cross-validation generator. An iterable yielding train/test splits. For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the mdl is not a classifier or if y is neither binary nor multiclass, KFold is used. n_jobs – integer, optional; Number of jobs to run in parallel (default 1). a matplotlib plot
plot_lift_curve(mdl, title='Lift Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]

Generates the Lift Curve from labels and scores/probabilities The lift curve is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html. The implementation here works only for binary classification.

Parameters: mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation. title – (string, optional): Title of the generated plot. Defaults to “Lift Curve”. figsize – (2-tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6). Defaults to None. title_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “large”. text_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “medium”. ax (matplotlib.axes.Axes): The axes on which the plot was drawn.
plot_roc_curve(mdl, label=None)[source]

The ROC curve, modified from Hands-On Machine learning with Scikit-Learn.

Parameters: mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation. label – string; label for the chart. a matplotlib plot
class mltrace.np2df(data, clmns=None)[source]

A class to convert numpy ndarray to a pandas DataFrame. It produces a callable object which returns a pandas.DataFrame

Parameters: data – numpy.ndarray data clmns – a list of titles for pandas DataFrame column names. If None, it produces C{num} where num changes as the index of the ndarray does.

## synthdata Module¶

This module provides basic framework for generating synthetic data resembling an existing dataset. One could determine types of each field and the possible values for each field. Then the InspectData class will produce data based on the given types. Moreover, one can associate marginal distributions to each field or a joint distribution for data generation. If no distribution is associated, then the data will be generated uniformly over the required ranges.

### Supported data types¶

The following data types are supported:

• SynthBin: Support for binary data, i.e., the data fields consisting of 0, 1 values;
• SynthInt: Support for integer valued data;
• SynthReal: Support for real valued data;
• SynthCat: Support for categorical type of data, i.e., the discrete variables whose values are predetermined;
• SynthDate: Support for datetime data;

Each of these data types accept data that is a 1-d numpy.array and rv which is a scipy.stats distribution, implementing rv.rvs to generate samples. Among the above, SynthInt and SynthReal accept two parameters a and b which are the lower and upper bounds of the sampling interval respectively. SynthDate accepts frmt which determines the date formatting for the input data.

### Generating Synthetic Data¶

The SynthData class is responsible for generating synthetic data based on types, distributions and relations defined on the data. One initiates an instance as:

sd = SynthData(df, default_rv='uniform', distribution_type='marginal, rv=None)


where df is the pandas dataframe that will be synthesized. The rest of arguments are optional:

• default_rv: determines the default distribution for those fields where no distribution is associated to. If the distribution_type is set to ‘joint’ this will be ignored.
• distribution_type:
determines if the distribution(s) calculated based on df are marginals or a single joint distribution.
• rv: determines a predefined distribution for joint distribution.

To set the type of a column of df one should use set_type method. This method accepts a list of column names, their types and a tuple of initiating parameters. Every column in the columns’ list will be given the same type. The type could be either an instance of SynthBin, SynthInt, SynthReal, SynthCat, SynthDate, or a string determining the type, e.g., ‘bin’, ‘int’, ‘real’, ‘cat’, ‘date’. If no type is associated to a column, it is assumed to be of categorical type.

The final command which generates the synthetic data is sample(num) where ‘num’ is the number of synthetic samples to be generated. This method will return a pandas.DataFrame containing synthetic data of size ‘num’.

### Constraints¶

It is quite common that the values of some fields in a record depend on other fields. Simple constraints on the values of a field and relations between pairs can be handled using field objects.

If it is required to impose a constraint on a column, on can use where() method. The statement to add a constraint would generally look like sd.where(field('clmn') > val) or sd.where(field(clmn1) <= field(clmn2)). The acceptable operators are ==, !=, >, <, >=, <=, in, nin. The operators in and nin check membership of elements of ‘clmn’ in ‘val’ which has to be an iterable or membership in the column ‘clmn2’. The in stands for belonging and nin stands for not in.

### Code documentation¶

class synthdat.SynthBase(data=None, rv=None)[source]

The base class for various synthetic data types.

static get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.SynthBin(data=None, rv=None)[source]

Support for binary data, i.e., the data fields consisting of 0, 1 values

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.SynthCat(data, rv=None)[source]

Support for categorical type of data, i.e., the discrete variables whose values are predetermined;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.SynthData(df, default_rv='uniform', distribution_type='marginal', rv=None)[source]

A class which takes a real pandas.DataFrame and produces synthetic data similar to the real data based on types and distributions provided by the user and/or extracted out of original data.

Parameters: df – a pandas.DataFrame containing original data. default_rv – default distribution of columns; default ‘uniform’. Also could be ‘normal’. Only effective if distribution_type is ‘marginal’, otherwise will be ignored. distribution_type – default ‘marginal’. Determines the type of distribution. If ‘joint’, then either a normal distribution is calculated based on provided data or will use rv if rv is not ‘None’. rv – default ‘None’. The joint distribution of variables. Only effective if distribution_type is ‘joint’.
filter(df)[source]

Filter the ‘df’ to remove illegal records according to constraints

Parameters: df – the dataframe to be filtered the filtered dataframe
generate(num)[source]

internal Generate num synthetic data records without considering constraints

Parameters: num – number of samples pandas.DataFrame
sample(num)[source]

Produces ‘num’ records of synthetic data following given types, distributions and constraints

Parameters: num – number of synthetic data records a dataframe consisting of ‘num’ synthetic records.
set_type(clmns, typ, param=None)[source]

Define the type of columns.

Parameters: clmns – a list of df columns typ – the associated type, either an string (‘bin’, ‘int’, ‘real’, ‘cat’, ‘date) or an instance of SynthBin, SynthInt, SynthReal, SynthCat, SynthDate. param – parameters to be passed to synthetic data type if an string is given for ‘typ’. It could be a couple (a, b) for ‘int’ and ‘real’ type and just the format for ‘date’.
transform()[source]

internal to analyse and initialize data types and convert them to numerical values.

Returns: None
where(cns)[source]

Add a constraint of values of a column using field objects.

Parameters: cns – the constraint like field(clmn1) > val1 or field(clmn1) <= field(clmn2). None
class synthdat.SynthDate(data, frmt='%Y-%m-%d', rv=None)[source]

Support for datetime data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.SynthInt(a=None, b=None, data=None, rv=None)[source]

Support for integer valued data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.SynthReal(a=None, b=None, data=None, rv=None)[source]

Support for real valued data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters: x – the value to be converted into numeric the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters: X – the data to be converted to numbers translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns: list of values
class synthdat.field(fld)[source]

A generic class to handle simple constraints on columns. Accepts only one parameter which refers to a column in the DataFrame.