Code Documentation

Evolutionary Optimization Algorithm

class eoa.EOA(population, fitness, **kwargs)[source]

This is a base class acting as an umbrella to process an evolutionary optimization algorithm.

Parameters:
  • population – The whole possible population as a list
  • fitness – The fitness evaluation. Accepts an OrderedDict of individuals with their corresponding fitness and updates their fitness
  • init_pop – default=`UniformRand`; The python class that initiates the initial population
  • recomb – default=`UniformCrossover`; The python class that defines how to combine parents to produce children
  • mutation – default=`Mutation`; The python class that performs mutation on offspring population
  • termination – default=`MaxGenTermination`; The python class that determines the termination criterion
  • elitism – default=`Elites`; The python class that decides how to handel elitism
  • num_parents – The size of initial parents population
  • parents_porp – default=0.1; The size of initial parents population given as a portion of whole population (only used if num_parents is not given)
  • elits_porp – default=0.2; The porportion of offspring to be replaced by elite parents
  • mutation_prob – The probability that a component will be mutated (default: 0.05)
  • kwargs
class eoa.MaxGenTermination(**kwargs)[source]

Termination condition: Whether the maximum number of generations has been reached or not

class eoa.UniformCrossover(**kwargs)[source]

Recombination procedure.

class eoa.UniformRand(**kwargs)[source]

Initial population initiation.

Hilbert Space based regression

exception NpyProximation.Error(*args)[source]

Generic errors that may occur in the course of a run.

class NpyProximation.FunctionBasis[source]

This class generates two typical basis of functions: Polynomials and Trigonometric

static Fourier(n, deg, l=1.0)[source]

Returns the Fourier basis of degree deg in n variables with period l

Parameters:
  • n – number of variables
  • deg – the maximum degree of trigonometric combinations in the basis
  • l – the period
Returns:

the raw basis consists of trigonometric functions of degrees up to n

static Poly(n, deg)[source]

Returns a basis consisting of polynomials in n variables of degree at most deg.

Parameters:
  • n – number of variables
  • deg – highest degree of polynomials in the basis
Returns:

the raw basis consists of polynomials of degrees up to n

class NpyProximation.FunctionSpace(dim=1, measure=None, basis=None)[source]

A class tha facilitates a few types of computations over function spaces of type \(L_2(X, \mu)\)

Parameters:
  • dim – the dimension of ‘X’ (default: 1)
  • measure – an object of type Measure representing \(\mu\)
  • basis – a finite basis of functions to construct a subset of \(L_2(X, \mu)\)
FormBasis()[source]

Call this method to generate the orthogonal basis corresponding to the given basis. The result will be stored in a property called OrthBase which is a list of function that are orthogonal to each other with respect to the measure measure over the given range domain.

Series(f)[source]

Given a function f, this method finds and returns the coefficients of the series that approximates f as a linear combination of the elements of the orthogonal basis \(B\). In symbols \(\sum_{b\in B}\langle f, b\rangle b\).

Returns:the list of coefficients \(\langle f, b\rangle\) for \(b\in B\)
inner(f, g)[source]

Computes the inner product of the two parameters with respect to the measure measure, i.e., \(\int_Xf\cdot g d\mu\).

Parameters:
  • f – callable
  • g – callable
Returns:

the quantity of \(\int_Xf\cdot g d\mu\)

project(f, g)[source]

Finds the projection of f on g with respect to the inner product induced by the measure measure.

Parameters:
  • f – callable
  • g – callable
Returns:

the quantity of \(\frac{\langle f, g\rangle}{\|g\|_2}g\)

class NpyProximation.HilbertRegressor(deg=3, base=None, meas=None, fspace=None)[source]

Regression using Hilbert Space techniques Scikit-Learn style.

Parameters:
  • deg – int, default=3 The degree of polynomial regression. Only used if base is None
  • base – list, default = None a list of function to form an orthogonal function basis
  • meas – NpyProximation.Measure, default = None the measure to form the \(L_2(\mu)\) space. If None a discrete measure will be constructed based on fit inputs
  • fspace – NpyProximation.FunctionBasis, default = None the function subspace of \(L_2(\mu)\), if None it will be initiated according to self.meas
fit(X, y)[source]
Parameters:
  • X – Training data
  • y – Target values
Returns:

self

predict(X)[source]

Predict using the Hilbert regression method

Parameters:X – Samples
Returns:Returns predicted values
class NpyProximation.Measure(density=None, domain=None)[source]

Constructs a measure \(\mu\) based on density and domain.

Parameters:
  • density

    the density over the domain: + if none is given, it assumes uniform distribution

    • if a callable h is given, then \(d\mu=h(x)dx\)
    • if a dictionary is given, then \(\mu=\sum w_x\delta_x\) a discrete measure. The points \(x\) are the keys of the dictionary (tuples) and the weights \(w_x\) are the values.
  • domain – if density is a dictionary, it will be set by its keys. If callable, then domain must be a list of tuples defining the domain’s box. If None is given, it will be set to \([-1, 1]^n\)
integral(f)[source]

Calculates \(\int_{domain} fd\mu\).

Parameters:f – the integrand
Returns:the value of the integral
norm(p, f)[source]

Computes the norm-p of the f with respect to the current measure, i.e., \((\int_{domain}|f|^p d\mu)^{1/p}\).

Parameters:
  • p – a positive real number
  • f – the function whose norm is desired.
Returns:

\(\|f\|_{p, \mu}\)

class NpyProximation.Regression(points, dim=None)[source]

Given a set of points, i.e., a list of tuples of the equal lengths P, this class computes the best approximation of a function that fits the data, in the following sense:

  • if no extra parameters is provided, meaning that an object is initiated like R = Regression(P) then calling R.fit() returns the linear regression that fits the data.
  • if at initiation the parameter deg=n is set, then R.fit() returns the polynomial regression of degree n.
  • if a basis of functions provided by means of an OrthSystem object (R.SetOrthSys(orth)) then calling R.fit() returns the best approximation that can be found using the basic functions of the orth object.
Parameters:
  • points – a list of points to be fitted or a callable to be approximated
  • dim – dimension of the domain
SetFuncSpc(sys)[source]

Sets the bases of the orthogonal basis

Parameters:sysorthsys.OrthSystem object.
Returns:None

Note

For technical reasons, the measure needs to be given via SetMeasure method. Otherwise, the Lebesque measure on \([-1, 1]^n\) is assumed.

SetMeasure(meas)[source]

Sets the default measure for approximation.

Parameters:meas – a measure.Measure object
Returns:None
fit()[source]

Fits the best curve based on the optional provided orthogonal basis. If no basis is provided, it fits a polynomial of a given degree (at initiation) :return: The fit.

Sensitivity Analysis

Sensitivity analysis of a dataset based on a fit, sklearn style. The core functionality is provided by SALib .

class sensapprx.CorrelationThreshold(threshold=0.7)[source]

Selects a minimal set of features based on a given (Pearson) correlation threshold. The transformer omits the maximum number features with a high correlation and makes sure that the remaining features are not correlated behind the given threshold.

Parameters:threshold – the threshold for selecting correlated pairs.
fit(X, y=None)[source]

Finds the Pearson correlation among all features, selects the pairs with absolute value of correlation above the given threshold and selects a minimal set of features with low correlation

Parameters:
  • X – Training data
  • y – Target values (default: None)
Returns:

self

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X – numpy array of shape [n_samples, n_features]; Training set.
  • y – numpy array of shape [n_samples]; Target values.
Returns:

Transformed array

class sensapprx.SensAprx(n_features_to_select=10, regressor=None, method='sobol', margin=0.2, num_smpl=500, num_levels=5, grid_jump=1, num_resmpl=8, reduce=False, domain=None, probs=None)[source]

Transform data to select the most secretive factors according to a regressor that fits the data.

Parameters:
  • n_features_to_selectint number of top features to be selected
  • regressor – a sklearn style regressor to fit the data for sensitivity analysis
  • methodstr the sensitivity analysis method; defalt ‘sobol’, other options are ‘morris’ and ‘delta-mmnt’
  • margin – domain margine, default: .2
  • num_smpl – number of samples to perform the analysis, default: 1000
  • num_levels – number of levels for morris analysis, default: 6
  • grid_jump – grid jump for morris analysis, default: 1
  • num_resmpl – number of resamples for moment independent analysis, default: 10
  • reduce – whether to reduce the data points to uniques and calculate the averages of the target or not, default: False
  • domain – pre-calculated unique points, if none, and reduce is True then unique points will be found
  • probs – pre-calculated values associated to domain points
fit(X, y)[source]

Fits the regressor to the data (X, y) and performs a sensitivity analysis on the result of the regression.

Parameters:
  • X – Training data
  • y – Target values
Returns:

self

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X – numpy array of shape [n_samples, n_features]; Training set.
  • y – numpy array of shape [n_samples]; Target values.
Returns:

Transformed array

Optimized Pipeline Detector

class aml.AML(config=None, length=5, scoring='accuracy', cat_cols=None, surrogates=None, min_random_evals=15, cv=None, check_point='./', stack_res=True, stack_probs=True, stack_decision=True, verbose=1, n_jobs=-1)[source]

A class that accepts a nested dictionary with machine learning libraries as its keys and a dictionary of their parameters and their ranges as value of each key and finds an optimum combination based on training data.

Parameters:
  • config – A dictionary whose keys are scikit-learn-style objects (as strings) and its corresponding values are dictionaries of the parameters and their acceptable ranges/values
  • length – default=5; Maximum number of objects in generated pipelines
  • scoring – default=’accuracy’; The scoring method to be optimized. Must follow the sklearn scoring signature
  • cat_cols – default=None; The list of indices of categorical columns
  • surrogates – default=None; A list of 4-tuples determining surrogates. The first entity of each pair is a scikit-learn regressor and the 2nd entity is the number of iterations that this surrogate needs to be estimated and optimized. The 3rd is the sampling strategy and the 4th is the scipy.optimize solver
  • min_random_evals – default=15; Number of randomly sampled initial values for hyper parameters
  • cv – default=`ShuffleSplit(n_splits=3, test_size=.25); The cross validation method
  • check_point – default=’./’; The path where the optimization results will be stored
  • stack_res – default=True; StackingEstimator`s `res
  • stack_probs – default=True; StackingEstimator`s `probs
  • stack_decision – default=True; StackingEstimator`s `decision
  • verbose – default=1; Level of output details
  • n_jobs – int, default=-1; number of processes to run in parallel
add_surrogate(estimator, itrs, sampling=None, optim='L-BFGS-B')[source]

Adding a regressor for surrogate optimization procedure.

Parameters:
  • estimator – A scikit-learn style regressor
  • itrs – Number of iterations the estimator needs to be fitted and optimized
  • sampling – default= BoxSample; The sampling strategy (CompactSample, BoxSample or SphereSample)
  • optim – default=’L-BFGS-B’;`scipy.optimize` solver
Returns:

None

eoa_fit(X, y, **kwargs)[source]

Applies evolutionary optimization methods to find an optimum pipeline

Parameters:
  • X – Training data
  • y – Corresponding observations
  • kwargsEOA parameters
Returns:

self

fit(X, y)[source]

Generates and optimizes all legitimate pipelines. The best pipeline can be retrieved from self.best_estimator_

Parameters:
  • X – Training data
  • y – Corresponding observations
Returns:

self

get_top(num=5)[source]

Finds the top n pipelines

Parameters:num – Number of pipelines to be returned
Returns:An OrderedDict of top models
optimize_pipeline(seq, X, y)[source]

Constructs and optimizes a pipeline according to the steps passed through seq which is a tuple of estimators and transformers.

Parameters:
  • seq – the tuple of steps of the pipeline to be optimized
  • X – numpy array of training features
  • y – numpy array of training values
Returns:

the optimized pipeline and its score

types()[source]

Recognizes the type of each estimator to determine legitimate placement of each

Returns:None
class aml.StackingEstimator(estimator, res=True, probs=True, decision=True)[source]

Meta-transformer for adding predictions and/or class probabilities as synthetic feature(s).

Parameters:
  • estimator – object with fit, predict, and predict_proba methods. The estimator to generate synthetic features from.
  • res – True (default), stacks the final result of estimator
  • probs – True (default), stacks probabilities calculated by estimator
  • decision – True (default), stacks the result of decision function of the estimator
fit(X, y=None, **fit_params)[source]

Fit the StackingEstimator meta-transformer.

Parameters:
  • X – array-like of shape (n_samples, n_features). The training input samples.
  • y – array-like, shape (n_samples,). The target values (integers that correspond to classes in classification, real numbers in regression).
  • fit_params – Other estimator-specific parameters.
Returns:

self, object. Returns a copy of the estimator

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

**params : dict
Estimator parameters.
self : estimator instance
Estimator instance.
transform(X)[source]

Transform data by adding two synthetic feature(s).

Parameters:X – numpy ndarray, {n_samples, n_components}. New data, where n_samples is the number of samples and n_components is the number of components.
Returns:X_transformed: array-like, shape (n_samples, n_features + 1) or (n_samples, n_features + 1 + n_classes) for classifier with predict_proba attribute; The transformed feature set.
class aml.Words(letters, last=None, first=None, repeat=False)[source]

This class takes a set as alphabet and generates words of a given length accordingly. A Words instant accepts the following parameters:

Parameters:
  • letters – is a set of letters (symbols) to make up the words
  • last – a subset of letters that are allowed to appear at the end of a word
  • first – a set of words that can only appear at the beginning of a word
  • repeat – whether consecutive occurrence of a letter is allowed
Generate(l)[source]

Generates the set of legitimate words of length l

Parameters:l – int, the length of words
Returns:set of all legitimate words of length l

MLTrace: A machine learning progress tracker

This module provides some basic functionality to track the process of machine learning model development. It sets up a SQLite db-file and stores selected models, graphs, and data (for convenience) and recovers them as requested.

mltrace uses peewee and pandas for data manipulation.

It also has built in capabilities to generate some typical plots and graph in machine learning.

class mltrace.Data(*args, **kwargs)[source]

The class to generate the ‘data` table in the SQLite db-file. This table stores the whole given data for convenience.

DoesNotExist

alias of DataDoesNotExist

class mltrace.MLModel(*args, **kwargs)[source]

The class to generate the ‘mlmodel` table in the SQLite db-file. It stores the scikit-learn scheme of the model/pipeline, its parameters, etc.

DoesNotExist

alias of MLModelDoesNotExist

class mltrace.Metrics(*args, **kwargs)[source]

The class to generate the ‘metrics` table in the SQLite db-file. This table stores the calculated metrics of each stored model.

DoesNotExist

alias of MetricsDoesNotExist

class mltrace.Plots(*args, **kwargs)[source]

The class to generate the ‘plots` table in the SQLite db-file. This table stores matplotlib plots associated to each model.

DoesNotExist

alias of PlotsDoesNotExist

class mltrace.Saved(*args, **kwargs)[source]

The class to generate the ‘saved` table in the SQLite db-file. It keeps the pickled version of a stored model that can be later recovered.

DoesNotExist

alias of SavedDoesNotExist

class mltrace.Task(*args, **kwargs)[source]

The class to generate the ‘task` table in the SQLite db-file. This table keeps basic information about the task on hand, e.g., the task name, a brief description, target column, and columns to be ignored.

DoesNotExist

alias of TaskDoesNotExist

class mltrace.Weights(*args, **kwargs)[source]

The class to generate the ‘weights` table in the SQLite db-file. Stores some sensitivity measures, correlations, etc.

DoesNotExist

alias of WeightsDoesNotExist

class mltrace.mltrack(task, task_id=None, db_name='mltrack.db', cv=None)[source]

This class instantiates an object that tracks the ML activities and store them upon request.

Parameters:
  • task – ‘str’ the task name
  • task_is – the id of an existing task, if the name is not provided.
  • db_name – a file name for the SQLite database
  • cv – the default cross validation method, must be a valid cv based on sklearn.model_selection; default: ShuffleSplit(n_splits=3, test_size=.25)
FeatureWeights(weights=('pearson', 'variance'), **kwargs)[source]

Calculates the requested weights and log them

Parameters:
  • weights – a list of weights, a subset of {‘pearson’, ‘variance’, ‘relieff’, ‘surf’, ‘sobol’, ‘morris’, ‘delta_mmnt’, ‘info-gain’}
  • kwargs – all input acceptable by skrebate.ReliefF, skrebate.surf, sensapprx.SensAprx
Returns:

None

LoadModel(mid)[source]

Loads a model corresponding to an id

Parameters:mid – the model id
Returns:an unfitted model
static LoadPlot(pid)[source]

Loads a matplotlib plot

Parameters:pid – the id of the plot
Returns:a matplotlib figure
LogMetrics(mdl, cv=None)[source]

Logs metrics of an already logged model using a cross validation methpd

Parameters:
  • mdl – the model to be measured
  • cv – cross validation method
Returns:

a dictionary of all measures with their corresponding values for the model

LogModel(mdl, name=None)[source]

Log a machine learning model

Parameters:
  • mdl – a scikit-learn compatible estimator/pipeline
  • name – an arbitrary string to name the model
Returns:

modified instance of mdl which carries a new attribute mltrack_id as its id.

PreserveModel(mdl)[source]

Pickles and preserves an already logged model

Parameters:mdl – a logged model
Returns:None
RecoverModel(mdl_id)[source]

Recovers a pickled model

Parameters:mdl_id – a valid mltrack_id
Returns:a fitted model
RegisterData(source_df, target)[source]

Registers a pandas DataFrame into the SQLite database. Upon a call, it also sets self.X and self.y which are numpy arrays.

Parameters:
  • source_df – the pandas DataFrame to be stored
  • target – the name of the target column to be predicted
Returns:

None

TopFeatures(num=10)[source]

Returns num of top features in the data based on calculated weights

Parameters:num – number of top features to return
Returns:an OrderedDict of top features
UpdateModel(mdl, name)[source]

Updates an already logged model which has mltrack_id set.

Parameters:
  • mdl – a scikit-learn compatible estimator/pipeline
  • name – an arbitrary string to name the model
Returns:

None

UpdateTask(data)[source]

Updates the current task info.

Parameters:data

a dictionary that may include some the followings as its keys:

  • ’name’: the corresponding value will replace the current task name
  • ’description’: the corresponding value will replace the current description
  • ’ignore’: the corresponding value will replace the current ignored columns
Returns:None
allModels()[source]

Lists all logged models as a pandas DataFrame

Returns:a pandas DataFrame
allPlots(mdl_id)[source]

Lists all stored plots for a model with mdl_id as a pandas DataFrame

Parameters:mdl_id – a valid mltrack_id
Returns:a pandas DataFrame
allPreserved()[source]

Lists all pickled models as a pandas DataFrame

Returns:a pandas DataFrame
allTasks()[source]

Lists all tasks as a pandas DataFrame

Returns:a pandas DataFrame
static cumulative_gain_curve(y_true, y_score, pos_label=None)[source]

This function generates the points necessary to plot the Cumulative Gain Note: This implementation is restricted to the binary classification task.

Parameters:
  • y_true – (array-like, shape (n_samples)): True labels of the data.
  • y_score – (array-like, shape (n_samples)): Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
  • pos_label – (int or str, default=None): Label considered as positive and others are considered negative
Returns:

percentages (numpy.ndarray): An array containing the X-axis values for plotting the Cumulative Gains chart. gains (numpy.ndarray): An array containing the Y-axis values for one curve of the Cumulative Gains chart.

Raise:

ValueError: If y_true is not composed of 2 classes. The Cumulative Gain Chart is only relevant in binary classification.

static getBest(metric)[source]

Finds the model with the best metric.

Parameters:metric – the metric to find the best stored model for
Returns:the model wiith the best metric
get_data()[source]

Retrieves data in numpy format

Returns:numpy arrays X, y
get_dataframe()[source]

Retrieves data in pandas DataFrame format

Returns:pandas DataFrame containing all data
heatmap(corr_df=None, sort_by=None, ascending=False, font_size=3, cmap='gnuplot2', idx_col='feature', ignore=())[source]

Plots a heatmap from the values of the dataframe corr_df

Parameters:
  • corr_df – value container
  • idx_col – the column whose values will be used as index
  • sort_by

    dataframe will be sorted descending by values of this column.

    If None, the first column is used

  • font_size – font size, defalut 3
  • cmap

    color mapping. Must be one of the followings

    ’viridis’, ‘plasma’, ‘inferno’, ‘magma’, ‘cividis’, ‘Greys’, ‘Purples’,

    ’Blues’, ‘Greens’, ‘Oranges’, ‘Reds’, ‘YlOrBr’, ‘YlOrRd’, ‘OrRd’, ‘PuRd’,

    ’RdPu’, ‘BuPu’, ‘GnBu’, ‘PuBu’, ‘YlGnBu’, ‘PuBuGn’, ‘BuGn’, ‘YlGn’,

    ’binary’, ‘gist_yarg’, ‘gist_gray’, ‘gray’, ‘bone’, ‘pink’, ‘spring’,

    ’summer’, ‘autumn’, ‘winter’, ‘cool’, ‘Wistia’, ‘hot’, ‘afmhot’,

    ’gist_heat’, ‘copper’, ‘PiYG’, ‘PRGn’, ‘BrBG’, ‘PuOr’, ‘RdGy’, ‘RdBu’,

    ’RdYlBu’, ‘RdYlGn’, ‘Spectral’, ‘coolwarm’, ‘bwr’, ‘seismic’, ‘twilight’,

    ’twilight_shifted’, ‘hsv’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’,

    ’Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’,

    ’flag’, ‘prism’, ‘ocean’, ‘gist_earth’, ‘terrain’, ‘gist_stern’, ‘gnuplot’,

    ’gnuplot2’, ‘CMRmap’, ‘cubehelix’, ‘brg’, ‘gist_rainbow’, ‘rainbow’,

    ’jet’, ‘nipy_spectral’, ‘gist_ncar’

Returns:

matplotlib pyplot instance

plot_calibration_curve(mdl, name, fig_index=1, bins=10)[source]

Plots calibration curves.

Parameters:
  • mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
  • name – string; Title for the chart.
  • bins – number of bins to partition samples
Returns:

a matplotlib plot

plot_cumulative_gain(mdl, title='Cumulative Gains Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]

Generates the Cumulative Gains Plot from labels and scores/probabilities The cumulative gains chart is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://mlwiki.org/index.php/Cumulative_Gain_Chart. The implementation here works only for binary classification.

Parameters:
  • mdl

    object type that implements the “fit” and “predict” methods;

    An object of that type which is cloned for each validation.

  • title

    (string, optional): Title of the generated plot.

    Defaults to “Cumulative Gains Curve”.

  • figsize

    (2-tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6).

    Defaults to None.

  • title_fontsize

    (string or int, optional): Matplotlib-style fontsizes.

    Use e.g., “small”, “medium”, “large” or integer-values. Defaults to “large”.

  • text_fontsize

    (string or int, optional): Matplotlib-style fontsizes.

    Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “medium”.

Returns:

ax (matplotlib.axes.Axes): The axes on which the plot was drawn.

plot_learning_curve(mdl, title, ylim=None, cv=None, n_jobs=1, train_sizes=None, **kwargs)[source]

Generate a simple plot of the test and training learning curve.

Parameters:
  • mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
  • title – string; Title for the chart.
  • measure – string, a performance measure; must be one of hte followings: accuracy, f1, precision, recall, roc_auc
  • ylim – tuple, shape (ymin, ymax), optional; Defines minimum and maximum yvalues plotted.
  • cv

    int, cross-validation generator or an iterable, optional; Determines the cross-validation splitting strategy. Possible inputs for cv are:

    • None, to use the default 3-fold cross-validation,
    • integer, to specify the number of folds.
    • An object to be used as a cross-validation generator.
    • An iterable yielding train/test splits.

    For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the mdl is not a classifier or if y is neither binary nor multiclass, KFold is used.

  • n_jobs – integer, optional; Number of jobs to run in parallel (default 1).
Returns:

a matplotlib plot

plot_lift_curve(mdl, title='Lift Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]

Generates the Lift Curve from labels and scores/probabilities The lift curve is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html. The implementation here works only for binary classification.

Parameters:
  • mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
  • title – (string, optional): Title of the generated plot. Defaults to “Lift Curve”.
  • figsize – (2-tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6). Defaults to None.
  • title_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “large”.
  • text_fontsize – (string or int, optional): Matplotlib-style fontsizes. Use e.g. “small”, “medium”, “large” or integer-values. Defaults to “medium”.
Returns:

ax (matplotlib.axes.Axes): The axes on which the plot was drawn.

plot_roc_curve(mdl, label=None)[source]

The ROC curve, modified from Hands-On Machine learning with Scikit-Learn.

Parameters:
  • mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
  • label – string; label for the chart.
Returns:

a matplotlib plot

class mltrace.np2df(data, clmns=None)[source]

A class to convert numpy ndarray to a pandas DataFrame. It produces a callable object which returns a pandas.DataFrame

Parameters:
  • datanumpy.ndarray data
  • clmns

    a list of titles for pandas DataFrame column names.

    If None, it produces C{num} where num changes as the index of the ndarray does.

synthdata Module

This module provides basic framework for generating synthetic data resembling an existing dataset. One could determine types of each field and the possible values for each field. Then the InspectData class will produce data based on the given types. Moreover, one can associate marginal distributions to each field or a joint distribution for data generation. If no distribution is associated, then the data will be generated uniformly over the required ranges.

Supported data types

The following data types are supported:

  • SynthBin:Support for binary data, i.e., the data fields consisting of 0, 1 values;
  • SynthInt:Support for integer valued data;
  • SynthReal:Support for real valued data;
  • SynthCat:Support for categorical type of data, i.e., the discrete variables whose values are predetermined;
  • SynthDate:Support for datetime data;

Each of these data types accept data that is a 1-d numpy.array and rv which is a scipy.stats distribution, implementing rv.rvs to generate samples. Among the above, SynthInt and SynthReal accept two parameters a and b which are the lower and upper bounds of the sampling interval respectively. SynthDate accepts frmt which determines the date formatting for the input data.

Generating Synthetic Data

The SynthData class is responsible for generating synthetic data based on types, distributions and relations defined on the data. One initiates an instance as:

sd = SynthData(df, default_rv='uniform', distribution_type='marginal, rv=None)

where df is the pandas dataframe that will be synthesized. The rest of arguments are optional:

  • default_rv:determines the default distribution for those fields where no distribution is associated to. If the distribution_type is set to ‘joint’ this will be ignored.
  • distribution_type:
     determines if the distribution(s) calculated based on df are marginals or a single joint distribution.
  • rv:determines a predefined distribution for joint distribution.

To set the type of a column of df one should use set_type method. This method accepts a list of column names, their types and a tuple of initiating parameters. Every column in the columns’ list will be given the same type. The type could be either an instance of SynthBin, SynthInt, SynthReal, SynthCat, SynthDate, or a string determining the type, e.g., ‘bin’, ‘int’, ‘real’, ‘cat’, ‘date’. If no type is associated to a column, it is assumed to be of categorical type.

The final command which generates the synthetic data is sample(num) where ‘num’ is the number of synthetic samples to be generated. This method will return a pandas.DataFrame containing synthetic data of size ‘num’.

Constraints

It is quite common that the values of some fields in a record depend on other fields. Simple constraints on the values of a field and relations between pairs can be handled using field objects.

If it is required to impose a constraint on a column, on can use where() method. The statement to add a constraint would generally look like sd.where(field('clmn') > val) or sd.where(field(clmn1) <= field(clmn2)). The acceptable operators are ==, !=, >, <, >=, <=, in, nin. The operators in and nin check membership of elements of ‘clmn’ in ‘val’ which has to be an iterable or membership in the column ‘clmn2’. The in stands for belonging and nin stands for not in.

Code documentation

class synthdat.SynthBase(data=None, rv=None)[source]

The base class for various synthetic data types.

static get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.SynthBin(data=None, rv=None)[source]

Support for binary data, i.e., the data fields consisting of 0, 1 values

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.SynthCat(data, rv=None)[source]

Support for categorical type of data, i.e., the discrete variables whose values are predetermined;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.SynthData(df, default_rv='uniform', distribution_type='marginal', rv=None)[source]

A class which takes a real pandas.DataFrame and produces synthetic data similar to the real data based on types and distributions provided by the user and/or extracted out of original data.

Parameters:
  • df – a pandas.DataFrame containing original data.
  • default_rv – default distribution of columns; default ‘uniform’. Also could be ‘normal’. Only effective if distribution_type is ‘marginal’, otherwise will be ignored.
  • distribution_type – default ‘marginal’. Determines the type of distribution. If ‘joint’, then either a normal distribution is calculated based on provided data or will use rv if rv is not ‘None’.
  • rv – default ‘None’. The joint distribution of variables. Only effective if distribution_type is ‘joint’.
filter(df)[source]

Filter the ‘df’ to remove illegal records according to constraints

Parameters:df – the dataframe to be filtered
Returns:the filtered dataframe
generate(num)[source]

internal Generate num synthetic data records without considering constraints

Parameters:num – number of samples
Returns:pandas.DataFrame
sample(num)[source]

Produces ‘num’ records of synthetic data following given types, distributions and constraints

Parameters:num – number of synthetic data records
Returns:a dataframe consisting of ‘num’ synthetic records.
set_type(clmns, typ, param=None)[source]

Define the type of columns.

Parameters:
  • clmns – a list of df columns
  • typ – the associated type, either an string (‘bin’, ‘int’, ‘real’, ‘cat’, ‘date) or an instance of SynthBin, SynthInt, SynthReal, SynthCat, SynthDate.
  • param – parameters to be passed to synthetic data type if an string is given for ‘typ’. It could be a couple (a, b) for ‘int’ and ‘real’ type and just the format for ‘date’.
transform()[source]

internal to analyse and initialize data types and convert them to numerical values.

Returns:None
where(cns)[source]

Add a constraint of values of a column using field objects.

Parameters:cns – the constraint like field(clmn1) > val1 or field(clmn1) <= field(clmn2).
Returns:None
class synthdat.SynthDate(data, frmt='%Y-%m-%d', rv=None)[source]

Support for datetime data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.SynthInt(a=None, b=None, data=None, rv=None)[source]

Support for integer valued data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.SynthReal(a=None, b=None, data=None, rv=None)[source]

Support for real valued data;

get_val(x)[source]

Coverts the value of x into numeric that can be handled by random distributions

Parameters:x – the value to be converted into numeric
Returns:the corresponding numeric value
ret_val(X)[source]

Similar to set_val but works on the data stored in X

Parameters:X – the data to be converted to numbers
Returns:translation of X
set_val()[source]

Generates the numeric translation of the given data.

Returns:list of values
class synthdat.field(fld)[source]

A generic class to handle simple constraints on columns. Accepts only one parameter which refers to a column in the DataFrame.