Code Documentation¶
Surrogate Random Search¶
This module provides basic functionality to optimize an expensive blackbox function based on Surrogate Random Search. The Structured Random Search (SRS) method attempts to approximate an optimal solution to the following
minimize \(f(x)\)
 subject to
 \(g_i(x)\ge0~\) \(~i=1,\dots, m,\)
where arbitrary evaluations of \(f\) is not a viable option. The original random search itself is guarantee to converge to a local solution, but the convergence is usually very slow and most information about \(f\) is dismissed except for the best candidate. SRS tries to use all information acquired about \(f\) so far during the iterations. At \(i^{th}\) round of iteration SRS replaces \(f\) by a surrogate \(\hat{f}_i\) that enjoys many nice analytical properties which make its optimization an easier task to overcome. Then by solving the above optimization problem with \(f\) replaced by \(\hat{f}\) one gets a more informed candidate \(x_i\) for the next iteration. If a certain number of iterations do not result in a better candidate, the method returns back to random sampling to collect more information about \(f\). The surrogate function \(\hat{f}_i\) can be found in many different ways such as (non)linear regression, Gaussian process regression, etc. and SurrogateSearch do not have a preference. But by default it uses a polynomial regression of degree 3 if no regressor is provided. Any regressor following the architecture of scikilearn is acceptable. Note that regressors usually require a minimum number of data points to function properly.
There are various ways for sampling a random point in feasible space which affects the performance of SRS. SurrogateSearch implements two methods: BoxSample and SphereSample. One can choose whether to shrink the volume of the box or sphere tha tthe sample is selected from too.

class
structsearch.
BaseSample
(**kwargs)[source]¶ This is the base class for various sampling methods.

class
structsearch.
BoxSample
(**kwargs)[source]¶ Generates samples out of a box around a given center.
Parameters:  init_radius – float the initial halflength of the edges of the sampling box; default: 2.
 contraction – float the contraction factor for repeated sampling.

class
structsearch.
Categorical
(items, **kwargs)[source]¶ A list of possible values fr the search algorithm to choose from.
Parameters: items – A list of possible values for a parameter

class
structsearch.
HDReal
(a, b, **kwargs)[source]¶ An n dimensional box of real numbers corresponding to the classification groups (e.g. class_weight). a is the list of lower bounds and b is the list of upper bounds.
Parameters:  a – a tuple of lower bounds for each dimension
 b – a tuple of upper bounds for each dimension

class
structsearch.
Integer
(a=None, b=None, **kwargs)[source]¶ The range of possible values for an integer variable; a is the minimum and b is the maximum. Defaults are + and  infinity.
Parameters:  a – the lower bound for the integer interval defined by instance (accepting ‘numpy.inf’)
 b – the upper bound for the integer interval defined by instance (accepting ‘numpy.inf’)

class
structsearch.
Real
(a=None, b=None, **kwargs)[source]¶ The range of possible values for a real variable; a is the minimum and b is the maximum. Defaults are + and  infinity.
Parameters:  a – the lower bound for the (closed) interval defined by instance (accepting ‘numpy.inf’)
 b – the upper bound for the (closed) interval defined by instance (accepting ‘numpy.inf’)

class
structsearch.
SphereSample
(**kwargs)[source]¶ Generates samples out of an sphere around a given center.
Parameters:  init_radius – float the initial radius of the sampling sphere; default: 2.
 contraction – float the contraction factor for repeated sampling.

class
structsearch.
SurrogateRandomCV
(estimator, params, scoring=None, fit_params=None, n_jobs=1, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score=True, max_iter=50, min_evals=25, regressor=None, sampling=<class 'structsearch.CompactSample'>, radius=None, contraction=0.95, search_sphere=False, optimizer='scipy', scipy_solver='SLSQP', task_name='optim_task', warm_start=True, Continue=False, max_itr_no_prog=10000, ineqs=(), init=None, optimithon_t_method='Cauchy_x', optimithon_dd_method='BFGS', optimithon_ls_method='Backtrack', optimithon_ls_bt_method='Armijo', optimithon_br_func='Carrol', optimithon_penalty=1000000.0, optimithon_max_iter=100, optimithon_difftool=0.0)[source]¶ Surrogate Random Search optimization over hyperparameters.
The parameters of the estimator used to apply these methods are optimized by crossvalidated search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by max_iter.
Parameters:  estimator – estimator object. A object of that type is instantiated for each search point. This object is
assumed to implement the scikitlearn estimator api. Either estimator needs to provide a
score
function, orscoring
must be passed.  params – dict Dictionary with parameters names (string) as keys and domains as lists of parameter ranges to try. Domains are either lists of categorical (string) values or 2 element lists specifying a min and max for integer or float parameters
 scoring – string, callable or None, default=None
A string (see model evaluation documentation) or a scorer callable
object / function with signature
scorer(estimator, X, y)
. IfNone
, thescore
method of the estimator is used.  max_iter – int, default=50
Number of parameter settings that are sampled. max_iter trades
off runtime vs quality of the solution. Consider increasing
n_points
if you want to try more parameter settings in parallel.  min_evals – int, default=25; Number of random evaluations before employing an approximation for the response surface.
 n_jobs – int, default=1; number of processes to run in parallel
 fit_params – dict, optional; Parameters to pass to the fit method.
 pre_dispatch –
int, or string, optional; Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
 None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fastrunning jobs, to avoid delays due to ondemand spawning of the jobs
 An int, giving the exact number of total jobs that are spawned
 A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
 cv –
int, crossvalidation generator or an iterable, optional Determines the crossvalidation splitting strategy. Possible inputs for cv are:
 None, to use the default 3fold cross validation,
 integer, to specify the number of folds in a (Stratified)KFold,
 An object to be used as a crossvalidation generator.
 An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used.  refit – boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this RandomizedSearchCV instance after fitting.
 verbose – int, default=0 Prints internal information about the progress of each iteration.

fit
(X, y=None, groups=None, **fit_params)[source]¶ Run fit with all sets of parameters.
Parameters:  X – arraylike, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
 y – arraylike, shape = [n_samples] or [n_samples, n_output], optional; Target relative to X for classification or regression; None for unsupervised learning.
 groups – arraylike, with shape (n_samples,), optional; Group labels for the samples used while splitting the dataset into train/test set.
 fit_params – dict of string > object; Parameters passed to the fit method of the estimator
Returns: self
 estimator – estimator object. A object of that type is instantiated for each search point. This object is
assumed to implement the scikitlearn estimator api. Either estimator needs to provide a

class
structsearch.
SurrogateSearch
(objective, **kwargs)[source]¶ An implementation of the Surrogate Random Search (SRS).
Parameters:  objective – a callable, the function to be minimized
 ineq – a list of callables which represent the constraints (default: [])
 task_name – str a name to refer to the optimization task, store & restore previously acquired (default: ‘optim_task’)
 bounds – a list of tuples of real numbers representing the bounds on each variable; default: None
 max_iter – int the maximum number of iterations (default: 50)
 radius – float the initial radius of sampling region (default: 2.)
 contraction – float the rate of radius contraction (default: .9)
 sampling – the sampling method either BoxSample or SphereSample (default SphereSample)
 search_sphere – boolean whether to fit the surrogate function on a neighbourhood of current candidate or over all sampled points (default: False)
 deg – int degree of polynomial regressor if one chooses to fitt polynomial surrogates (default: 3)
 min_evals – int minimum number of samples before fitting a surrogate (default will be calculated as if the surrogate is a polynomial of degree 3)
 regressor – a regressor (scikitlearn style) to find a surrogate
 scipy_solver – str the scipy solver (‘COBYLA’ or ‘SLSQP’) to solve the local optimization problem at each iteration (default: ‘COBYLA’)
 max_itr_no_prog – int maximum number of iterations with no progress (default: infinity)
 Continue – boolean continues the progress from where it has been interrupted (default: False)
 warm_start – boolean use data from the previous attempts, but starts from the first iteration (default: False)
 verbose – boolean whether to report the progress on commandline or not (default: False)
Evolutionary Optimization Algorithm¶

class
eoa.
EOA
(population, fitness, **kwargs)[source]¶ This is a base class acting as an umbrella to process an evolutionary optimization algorithm.
Parameters:  population – The whole possible population as a list
 fitness – The fitness evaluation. Accepts an OrderedDict of individuals with their corresponding fitness and updates their fitness
 init_pop – default=`UniformRand`; The python class that initiates the initial population
 recomb – default=`UniformCrossover`; The python class that defines how to combine parents to produce children
 mutation – default=`Mutation`; The python class that performs mutation on offspring population
 termination – default=`MaxGenTermination`; The python class that determines the termination criterion
 elitism – default=`Elites`; The python class that decides how to handel elitism
 num_parents – The size of initial parents population
 parents_porp – default=0.1; The size of initial parents population given as a portion of whole population (only used if num_parents is not given)
 elits_porp – default=0.2; The porportion of offspring to be replaced by elite parents
 mutation_prob – The probability that a component will be mutated (default: 0.05)
 kwargs –
Hilbert Space based regression¶

exception
NpyProximation.
Error
(*args)[source]¶ Generic errors that may occur in the course of a run.

class
NpyProximation.
FunctionBasis
[source]¶ This class generates two typical basis of functions: Polynomials and Trigonometric

static
Fourier
(n, deg, l=1.0)[source]¶ Returns the Fourier basis of degree deg in n variables with period l
Parameters:  n – number of variables
 deg – the maximum degree of trigonometric combinations in the basis
 l – the period
Returns: the raw basis consists of trigonometric functions of degrees up to n

static

class
NpyProximation.
FunctionSpace
(dim=1, measure=None, basis=None)[source]¶ A class tha facilitates a few types of computations over function spaces of type \(L_2(X, \mu)\)
Parameters:  dim – the dimension of ‘X’ (default: 1)
 measure – an object of type Measure representing \(\mu\)
 basis – a finite basis of functions to construct a subset of \(L_2(X, \mu)\)

FormBasis
()[source]¶ Call this method to generate the orthogonal basis corresponding to the given basis. The result will be stored in a property called
OrthBase
which is a list of function that are orthogonal to each other with respect to the measuremeasure
over the given rangedomain
.

Series
(f)[source]¶ Given a function f, this method finds and returns the coefficients of the series that approximates f as a linear combination of the elements of the orthogonal basis \(B\). In symbols \(\sum_{b\in B}\langle f, b\rangle b\).
Returns: the list of coefficients \(\langle f, b\rangle\) for \(b\in B\)

class
NpyProximation.
HilbertRegressor
(deg=3, base=None, meas=None, fspace=None)[source]¶ Regression using Hilbert Space techniques ScikitLearn style.
Parameters:  deg – int, default=3 The degree of polynomial regression. Only used if base is None
 base – list, default = None a list of function to form an orthogonal function basis
 meas – NpyProximation.Measure, default = None the measure to form the \(L_2(\mu)\) space. If None a discrete measure will be constructed based on fit inputs
 fspace – NpyProximation.FunctionBasis, default = None the function subspace of \(L_2(\mu)\), if None it will be initiated according to self.meas

class
NpyProximation.
Measure
(density=None, domain=None)[source]¶ Constructs a measure \(\mu\) based on density and domain.
Parameters:  density –
the density over the domain: + if none is given, it assumes uniform distribution
 if a callable h is given, then \(d\mu=h(x)dx\)
 if a dictionary is given, then \(\mu=\sum w_x\delta_x\) a discrete measure. The points \(x\) are the keys of the dictionary (tuples) and the weights \(w_x\) are the values.
 domain – if density is a dictionary, it will be set by its keys. If callable, then domain must be a list of tuples defining the domain’s box. If None is given, it will be set to \([1, 1]^n\)
 density –

class
NpyProximation.
Regression
(points, dim=None)[source]¶ Given a set of points, i.e., a list of tuples of the equal lengths P, this class computes the best approximation of a function that fits the data, in the following sense:
 if no extra parameters is provided, meaning that an object is initiated like
R = Regression(P)
then callingR.fit()
returns the linear regression that fits the data.  if at initiation the parameter deg=n is set, then
R.fit()
returns the polynomial regression of degree n.  if a basis of functions provided by means of an OrthSystem object (
R.SetOrthSys(orth)
) then callingR.fit()
returns the best approximation that can be found using the basic functions of the orth object.
Parameters:  points – a list of points to be fitted or a callable to be approximated
 dim – dimension of the domain

SetFuncSpc
(sys)[source]¶ Sets the bases of the orthogonal basis
Parameters: sys – orthsys.OrthSystem object. Returns: None Note
For technical reasons, the measure needs to be given via SetMeasure method. Otherwise, the Lebesque measure on \([1, 1]^n\) is assumed.
 if no extra parameters is provided, meaning that an object is initiated like
Sensitivity Analysis¶
Sensitivity analysis of a dataset based on a fit, sklearn style. The core functionality is provided by SALib .

class
sensapprx.
CorrelationThreshold
(threshold=0.7)[source]¶ Selects a minimal set of features based on a given (Pearson) correlation threshold. The transformer omits the maximum number features with a high correlation and makes sure that the remaining features are not correlated behind the given threshold.
Parameters: threshold – the threshold for selecting correlated pairs. 
fit
(X, y=None)[source]¶ Finds the Pearson correlation among all features, selects the pairs with absolute value of correlation above the given threshold and selects a minimal set of features with low correlation
Parameters:  X – Training data
 y – Target values (default: None)
Returns: self

fit_transform
(X, y=None, **fit_params)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters:  X – numpy array of shape [n_samples, n_features]; Training set.
 y – numpy array of shape [n_samples]; Target values.
Returns: Transformed array


class
sensapprx.
SensAprx
(n_features_to_select=10, regressor=None, method='sobol', margin=0.2, num_smpl=500, num_levels=5, grid_jump=1, num_resmpl=8, reduce=False, domain=None, probs=None)[source]¶ Transform data to select the most secretive factors according to a regressor that fits the data.
Parameters:  n_features_to_select – int number of top features to be selected
 regressor – a sklearn style regressor to fit the data for sensitivity analysis
 method – str the sensitivity analysis method; defalt ‘sobol’, other options are ‘morris’ and ‘deltammnt’
 margin – domain margine, default: .2
 num_smpl – number of samples to perform the analysis, default: 1000
 num_levels – number of levels for morris analysis, default: 6
 grid_jump – grid jump for morris analysis, default: 1
 num_resmpl – number of resamples for moment independent analysis, default: 10
 reduce – whether to reduce the data points to uniques and calculate the averages of the target or not, default: False
 domain – precalculated unique points, if none, and reduce is True then unique points will be found
 probs – precalculated values associated to domain points

fit
(X, y)[source]¶ Fits the regressor to the data (X, y) and performs a sensitivity analysis on the result of the regression.
Parameters:  X – Training data
 y – Target values
Returns: self

fit_transform
(X, y=None, **fit_params)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters:  X – numpy array of shape [n_samples, n_features]; Training set.
 y – numpy array of shape [n_samples]; Target values.
Returns: Transformed array
Optimized Pipeline Detector¶

class
aml.
AML
(config=None, length=5, scoring='accuracy', cat_cols=None, surrogates=None, min_random_evals=15, cv=None, check_point='./', stack_res=True, stack_probs=True, stack_decision=True, verbose=1, n_jobs=1)[source]¶ A class that accepts a nested dictionary with machine learning libraries as its keys and a dictionary of their parameters and their ranges as value of each key and finds an optimum combination based on training data.
Parameters:  config – A dictionary whose keys are scikitlearnstyle objects (as strings) and its corresponding values are dictionaries of the parameters and their acceptable ranges/values
 length – default=5; Maximum number of objects in generated pipelines
 scoring – default=’accuracy’; The scoring method to be optimized. Must follow the sklearn scoring signature
 cat_cols – default=None; The list of indices of categorical columns
 surrogates – default=None; A list of 4tuples determining surrogates. The first entity of each pair is a scikitlearn regressor and the 2nd entity is the number of iterations that this surrogate needs to be estimated and optimized. The 3rd is the sampling strategy and the 4th is the scipy.optimize solver
 min_random_evals – default=15; Number of randomly sampled initial values for hyper parameters
 cv – default=`ShuffleSplit(n_splits=3, test_size=.25); The cross validation method
 check_point – default=’./’; The path where the optimization results will be stored
 stack_res – default=True; StackingEstimator`s `res
 stack_probs – default=True; StackingEstimator`s `probs
 stack_decision – default=True; StackingEstimator`s `decision
 verbose – default=1; Level of output details
 n_jobs – int, default=1; number of processes to run in parallel

add_surrogate
(estimator, itrs, sampling=None, optim='LBFGSB')[source]¶ Adding a regressor for surrogate optimization procedure.
Parameters:  estimator – A scikitlearn style regressor
 itrs – Number of iterations the estimator needs to be fitted and optimized
 sampling – default= BoxSample; The sampling strategy (CompactSample, BoxSample or SphereSample)
 optim – default=’LBFGSB’;`scipy.optimize` solver
Returns: None

eoa_fit
(X, y, **kwargs)[source]¶ Applies evolutionary optimization methods to find an optimum pipeline
Parameters:  X – Training data
 y – Corresponding observations
 kwargs – EOA parameters
Returns: self

fit
(X, y)[source]¶ Generates and optimizes all legitimate pipelines. The best pipeline can be retrieved from self.best_estimator_
Parameters:  X – Training data
 y – Corresponding observations
Returns: self

get_top
(num=5)[source]¶ Finds the top n pipelines
Parameters: num – Number of pipelines to be returned Returns: An OrderedDict of top models

optimize_pipeline
(seq, X, y)[source]¶ Constructs and optimizes a pipeline according to the steps passed through seq which is a tuple of estimators and transformers.
Parameters:  seq – the tuple of steps of the pipeline to be optimized
 X – numpy array of training features
 y – numpy array of training values
Returns: the optimized pipeline and its score

class
aml.
StackingEstimator
(estimator, res=True, probs=True, decision=True)[source]¶ Metatransformer for adding predictions and/or class probabilities as synthetic feature(s).
Parameters:  estimator – object with fit, predict, and predict_proba methods. The estimator to generate synthetic features from.
 res – True (default), stacks the final result of estimator
 probs – True (default), stacks probabilities calculated by estimator
 decision – True (default), stacks the result of decision function of the estimator

fit
(X, y=None, **fit_params)[source]¶ Fit the StackingEstimator metatransformer.
Parameters:  X – arraylike of shape (n_samples, n_features). The training input samples.
 y – arraylike, shape (n_samples,). The target values (integers that correspond to classes in classification, real numbers in regression).
 fit_params – Other estimatorspecific parameters.
Returns: self, object. Returns a copy of the estimator

set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. **params : dict
 Estimator parameters.
 self : estimator instance
 Estimator instance.

transform
(X)[source]¶ Transform data by adding two synthetic feature(s).
Parameters: X – numpy ndarray, {n_samples, n_components}. New data, where n_samples is the number of samples and n_components is the number of components. Returns: X_transformed: arraylike, shape (n_samples, n_features + 1) or (n_samples, n_features + 1 + n_classes) for classifier with predict_proba attribute; The transformed feature set.

class
aml.
Words
(letters, last=None, first=None, repeat=False)[source]¶ This class takes a set as alphabet and generates words of a given length accordingly. A Words instant accepts the following parameters:
Parameters:  letters – is a set of letters (symbols) to make up the words
 last – a subset of letters that are allowed to appear at the end of a word
 first – a set of words that can only appear at the beginning of a word
 repeat – whether consecutive occurrence of a letter is allowed
MLTrace: A machine learning progress tracker¶
This module provides some basic functionality to track the process of machine learning model development. It sets up a SQLite dbfile and stores selected models, graphs, and data (for convenience) and recovers them as requested.
mltrace
uses peewee and pandas for
data manipulation.
It also has built in capabilities to generate some typical plots and graph in machine learning.

class
mltrace.
Data
(*args, **kwargs)[source]¶ The class to generate the ‘data` table in the SQLite dbfile. This table stores the whole given data for convenience.

DoesNotExist
¶ alias of
DataDoesNotExist


class
mltrace.
MLModel
(*args, **kwargs)[source]¶ The class to generate the ‘mlmodel` table in the SQLite dbfile. It stores the scikitlearn scheme of the model/pipeline, its parameters, etc.

DoesNotExist
¶ alias of
MLModelDoesNotExist


class
mltrace.
Metrics
(*args, **kwargs)[source]¶ The class to generate the ‘metrics` table in the SQLite dbfile. This table stores the calculated metrics of each stored model.

DoesNotExist
¶ alias of
MetricsDoesNotExist


class
mltrace.
Plots
(*args, **kwargs)[source]¶ The class to generate the ‘plots` table in the SQLite dbfile. This table stores matplotlib plots associated to each model.

DoesNotExist
¶ alias of
PlotsDoesNotExist


class
mltrace.
Saved
(*args, **kwargs)[source]¶ The class to generate the ‘saved` table in the SQLite dbfile. It keeps the pickled version of a stored model that can be later recovered.

DoesNotExist
¶ alias of
SavedDoesNotExist


class
mltrace.
Task
(*args, **kwargs)[source]¶ The class to generate the ‘task` table in the SQLite dbfile. This table keeps basic information about the task on hand, e.g., the task name, a brief description, target column, and columns to be ignored.

DoesNotExist
¶ alias of
TaskDoesNotExist


class
mltrace.
Weights
(*args, **kwargs)[source]¶ The class to generate the ‘weights` table in the SQLite dbfile. Stores some sensitivity measures, correlations, etc.

DoesNotExist
¶ alias of
WeightsDoesNotExist


class
mltrace.
mltrack
(task, task_id=None, db_name='mltrack.db', cv=None)[source]¶ This class instantiates an object that tracks the ML activities and store them upon request.
Parameters:  task – ‘str’ the task name
 task_is – the id of an existing task, if the name is not provided.
 db_name – a file name for the SQLite database
 cv – the default cross validation method, must be a valid cv based on sklearn.model_selection; default: ShuffleSplit(n_splits=3, test_size=.25)

FeatureWeights
(weights=('pearson', 'variance'), **kwargs)[source]¶ Calculates the requested weights and log them
Parameters:  weights – a list of weights, a subset of {‘pearson’, ‘variance’, ‘relieff’, ‘surf’, ‘sobol’, ‘morris’, ‘delta_mmnt’, ‘infogain’}
 kwargs – all input acceptable by
skrebate.ReliefF
,skrebate.surf
,sensapprx.SensAprx
Returns: None

LoadModel
(mid)[source]¶ Loads a model corresponding to an id
Parameters: mid – the model id Returns: an unfitted model

static
LoadPlot
(pid)[source]¶ Loads a matplotlib plot
Parameters: pid – the id of the plot Returns: a matplotlib figure

LogMetrics
(mdl, cv=None)[source]¶ Logs metrics of an already logged model using a cross validation methpd
Parameters:  mdl – the model to be measured
 cv – cross validation method
Returns: a dictionary of all measures with their corresponding values for the model

LogModel
(mdl, name=None)[source]¶ Log a machine learning model
Parameters:  mdl – a scikitlearn compatible estimator/pipeline
 name – an arbitrary string to name the model
Returns: modified instance of mdl which carries a new attribute mltrack_id as its id.

PreserveModel
(mdl)[source]¶ Pickles and preserves an already logged model
Parameters: mdl – a logged model Returns: None

RecoverModel
(mdl_id)[source]¶ Recovers a pickled model
Parameters: mdl_id – a valid mltrack_id Returns: a fitted model

RegisterData
(source_df, target)[source]¶ Registers a pandas DataFrame into the SQLite database. Upon a call, it also sets self.X and self.y which are numpy arrays.
Parameters:  source_df – the pandas DataFrame to be stored
 target – the name of the target column to be predicted
Returns: None

TopFeatures
(num=10)[source]¶ Returns num of top features in the data based on calculated weights
Parameters: num – number of top features to return Returns: an OrderedDict of top features

UpdateModel
(mdl, name)[source]¶ Updates an already logged model which has mltrack_id set.
Parameters:  mdl – a scikitlearn compatible estimator/pipeline
 name – an arbitrary string to name the model
Returns: None

UpdateTask
(data)[source]¶ Updates the current task info.
Parameters: data – a dictionary that may include some the followings as its keys:
 ’name’: the corresponding value will replace the current task name
 ’description’: the corresponding value will replace the current description
 ’ignore’: the corresponding value will replace the current ignored columns
Returns: None

allPlots
(mdl_id)[source]¶ Lists all stored plots for a model with mdl_id as a pandas DataFrame
Parameters: mdl_id – a valid mltrack_id Returns: a pandas DataFrame

static
cumulative_gain_curve
(y_true, y_score, pos_label=None)[source]¶ This function generates the points necessary to plot the Cumulative Gain Note: This implementation is restricted to the binary classification task.
Parameters:  y_true – (arraylike, shape (n_samples)): True labels of the data.
 y_score – (arraylike, shape (n_samples)): Target scores, can either be probability estimates of the positive class, confidence values, or nonthresholded measure of decisions (as returned by decision_function on some classifiers).
 pos_label – (int or str, default=None): Label considered as positive and others are considered negative
Returns: percentages (numpy.ndarray): An array containing the Xaxis values for plotting the Cumulative Gains chart. gains (numpy.ndarray): An array containing the Yaxis values for one curve of the Cumulative Gains chart.
Raise: ValueError: If y_true is not composed of 2 classes. The Cumulative Gain Chart is only relevant in binary classification.

static
getBest
(metric)[source]¶ Finds the model with the best metric.
Parameters: metric – the metric to find the best stored model for Returns: the model wiith the best metric

get_dataframe
()[source]¶ Retrieves data in pandas DataFrame format
Returns: pandas DataFrame containing all data

heatmap
(corr_df=None, sort_by=None, ascending=False, font_size=3, cmap='gnuplot2', idx_col='feature', ignore=())[source]¶ Plots a heatmap from the values of the dataframe corr_df
Parameters:  corr_df – value container
 idx_col – the column whose values will be used as index
 sort_by –
dataframe will be sorted descending by values of this column.
If None, the first column is used
 font_size – font size, defalut 3
 cmap –
color mapping. Must be one of the followings
’viridis’, ‘plasma’, ‘inferno’, ‘magma’, ‘cividis’, ‘Greys’, ‘Purples’,
’Blues’, ‘Greens’, ‘Oranges’, ‘Reds’, ‘YlOrBr’, ‘YlOrRd’, ‘OrRd’, ‘PuRd’,
’RdPu’, ‘BuPu’, ‘GnBu’, ‘PuBu’, ‘YlGnBu’, ‘PuBuGn’, ‘BuGn’, ‘YlGn’,
’binary’, ‘gist_yarg’, ‘gist_gray’, ‘gray’, ‘bone’, ‘pink’, ‘spring’,
’summer’, ‘autumn’, ‘winter’, ‘cool’, ‘Wistia’, ‘hot’, ‘afmhot’,
’gist_heat’, ‘copper’, ‘PiYG’, ‘PRGn’, ‘BrBG’, ‘PuOr’, ‘RdGy’, ‘RdBu’,
’RdYlBu’, ‘RdYlGn’, ‘Spectral’, ‘coolwarm’, ‘bwr’, ‘seismic’, ‘twilight’,
’twilight_shifted’, ‘hsv’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’,
’Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’,
’flag’, ‘prism’, ‘ocean’, ‘gist_earth’, ‘terrain’, ‘gist_stern’, ‘gnuplot’,
’gnuplot2’, ‘CMRmap’, ‘cubehelix’, ‘brg’, ‘gist_rainbow’, ‘rainbow’,
’jet’, ‘nipy_spectral’, ‘gist_ncar’
Returns: matplotlib pyplot instance

plot_calibration_curve
(mdl, name, fig_index=1, bins=10)[source]¶ Plots calibration curves.
Parameters:  mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
 name – string; Title for the chart.
 bins – number of bins to partition samples
Returns: a matplotlib plot

plot_cumulative_gain
(mdl, title='Cumulative Gains Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]¶ Generates the Cumulative Gains Plot from labels and scores/probabilities The cumulative gains chart is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://mlwiki.org/index.php/Cumulative_Gain_Chart. The implementation here works only for binary classification.
Parameters:  mdl –
object type that implements the “fit” and “predict” methods;
An object of that type which is cloned for each validation.
 title –
(string, optional): Title of the generated plot.
Defaults to “Cumulative Gains Curve”.
 figsize –
(2tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6).
Defaults to
None
.  title_fontsize –
(string or int, optional): Matplotlibstyle fontsizes.
Use e.g., “small”, “medium”, “large” or integervalues. Defaults to “large”.
 text_fontsize –
(string or int, optional): Matplotlibstyle fontsizes.
Use e.g. “small”, “medium”, “large” or integervalues. Defaults to “medium”.
Returns: ax (
matplotlib.axes.Axes
): The axes on which the plot was drawn. mdl –

plot_learning_curve
(mdl, title, ylim=None, cv=None, n_jobs=1, train_sizes=None, **kwargs)[source]¶ Generate a simple plot of the test and training learning curve.
Parameters:  mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
 title – string; Title for the chart.
 measure – string, a performance measure; must be one of hte followings: accuracy, f1, precision, recall, roc_auc
 ylim – tuple, shape (ymin, ymax), optional; Defines minimum and maximum yvalues plotted.
 cv –
int, crossvalidation generator or an iterable, optional; Determines the crossvalidation splitting strategy. Possible inputs for cv are:
 None, to use the default 3fold crossvalidation,
 integer, to specify the number of folds.
 An object to be used as a crossvalidation generator.
 An iterable yielding train/test splits.
For integer/None inputs, if
y
is binary or multiclass,StratifiedKFold
used. If the mdl is not a classifier or ify
is neither binary nor multiclass,KFold
is used.  n_jobs – integer, optional; Number of jobs to run in parallel (default 1).
Returns: a matplotlib plot

plot_lift_curve
(mdl, title='Lift Curve', figsize=None, title_fontsize='large', text_fontsize='medium')[source]¶ Generates the Lift Curve from labels and scores/probabilities The lift curve is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html. The implementation here works only for binary classification.
Parameters:  mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
 title – (string, optional): Title of the generated plot. Defaults to “Lift Curve”.
 figsize – (2tuple, optional): Tuple denoting figure size of the plot e.g. (6, 6). Defaults to
None
.  title_fontsize – (string or int, optional): Matplotlibstyle fontsizes. Use e.g. “small”, “medium”, “large” or integervalues. Defaults to “large”.
 text_fontsize – (string or int, optional): Matplotlibstyle fontsizes. Use e.g. “small”, “medium”, “large” or integervalues. Defaults to “medium”.
Returns: ax (
matplotlib.axes.Axes
): The axes on which the plot was drawn.

plot_roc_curve
(mdl, label=None)[source]¶ The ROC curve, modified from HandsOn Machine learning with ScikitLearn.
Parameters:  mdl – object type that implements the “fit” and “predict” methods; An object of that type which is cloned for each validation.
 label – string; label for the chart.
Returns: a matplotlib plot

class
mltrace.
np2df
(data, clmns=None)[source]¶ A class to convert numpy ndarray to a pandas DataFrame. It produces a callable object which returns a pandas.DataFrame
Parameters:  data – numpy.ndarray data
 clmns –
a list of titles for pandas DataFrame column names.
If None, it produces C{num} where num changes as the index of the ndarray does.
synthdata
Module¶
This module provides basic framework for generating synthetic data resembling an existing dataset.
One could determine types of each field and the possible values for each field. Then the InspectData
class will
produce data based on the given types. Moreover, one can associate marginal distributions to each field or a joint
distribution for data generation. If no distribution is associated, then the data will be generated uniformly over
the required ranges.
Supported data types¶
The following data types are supported:
SynthBin: Support for binary data, i.e., the data fields consisting of 0, 1 values;
SynthInt: Support for integer valued data;
SynthReal: Support for real valued data;
SynthCat: Support for categorical type of data, i.e., the discrete variables whose values are predetermined;
SynthDate: Support for datetime data;
Each of these data types accept data that is a 1d numpy.array
and rv
which is a scipy.stats distribution,
implementing rv.rvs
to generate samples.
Among the above, SynthInt
and SynthReal
accept two parameters a
and b
which are the lower and upper
bounds of the sampling interval respectively. SynthDate
accepts frmt
which determines the date formatting for
the input data.
Generating Synthetic Data¶
The SynthData
class is responsible for generating synthetic data based on types, distributions and relations
defined on the data. One initiates an instance as:
sd = SynthData(df, default_rv='uniform', distribution_type='marginal, rv=None)
where df
is the pandas dataframe that will be synthesized. The rest of arguments are optional:
default_rv
:determines the default distribution for those fields where no distribution is associated to. If the distribution_type
is set to ‘joint’ this will be ignored.
distribution_type
:determines if the distribution(s) calculated based on df
are marginals or a single joint distribution.
rv: determines a predefined distribution for joint distribution.
To set the type of a column of df
one should use set_type
method. This method accepts a list of column names,
their types and a tuple of initiating parameters. Every column in the columns’ list will be given the same type.
The type could be either an instance of SynthBin
, SynthInt
, SynthReal
, SynthCat
, SynthDate
,
or a string determining the type, e.g., ‘bin’, ‘int’, ‘real’, ‘cat’, ‘date’. If no type is associated to a column,
it is assumed to be of categorical type.
The final command which generates the synthetic data is sample(num)
where ‘num’ is the number of synthetic samples
to be generated. This method will return a pandas.DataFrame
containing synthetic data of size ‘num’.
Constraints¶
It is quite common that the values of some fields in a record depend on other fields. Simple constraints on the
values of a field and relations between pairs can be handled using field
objects.
If it is required to impose a constraint on a column, on can use where()
method. The statement to add a constraint
would generally look like sd.where(field('clmn') > val)
or sd.where(field(clmn1) <= field(clmn2))
.
The acceptable operators are ==
, !=
, >
, <
, >=
, <=
, in
, nin
.
The operators in
and nin
check membership of elements of ‘clmn’ in ‘val’ which has to be an iterable or
membership in the column ‘clmn2’. The in
stands for belonging and nin
stands for not in.
Code documentation¶

class
synthdat.
SynthBase
(data=None, rv=None)[source]¶ The base class for various synthetic data types.

static
get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value

static

class
synthdat.
SynthBin
(data=None, rv=None)[source]¶ Support for binary data, i.e., the data fields consisting of 0, 1 values

get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value


class
synthdat.
SynthCat
(data, rv=None)[source]¶ Support for categorical type of data, i.e., the discrete variables whose values are predetermined;

get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value


class
synthdat.
SynthData
(df, default_rv='uniform', distribution_type='marginal', rv=None)[source]¶ A class which takes a real pandas.DataFrame and produces synthetic data similar to the real data based on types and distributions provided by the user and/or extracted out of original data.
Parameters:  df – a
pandas.DataFrame
containing original data.  default_rv – default distribution of columns; default ‘uniform’. Also could be ‘normal’.
Only effective if
distribution_type
is ‘marginal’, otherwise will be ignored.  distribution_type – default ‘marginal’. Determines the type of distribution.
If ‘joint’, then either a normal distribution is calculated based on provided data or will
use
rv
ifrv
is not ‘None’.  rv – default ‘None’. The joint distribution of variables. Only effective if
distribution_type
is ‘joint’.

filter
(df)[source]¶ Filter the ‘df’ to remove illegal records according to constraints
Parameters: df – the dataframe to be filtered Returns: the filtered dataframe

generate
(num)[source]¶ internal Generate
num
synthetic data records without considering constraintsParameters: num – number of samples Returns: pandas.DataFrame

sample
(num)[source]¶ Produces ‘num’ records of synthetic data following given types, distributions and constraints
Parameters: num – number of synthetic data records Returns: a dataframe consisting of ‘num’ synthetic records.

set_type
(clmns, typ, param=None)[source]¶ Define the type of columns.
Parameters:  clmns – a list of
df
columns  typ – the associated type, either an string (‘bin’, ‘int’, ‘real’, ‘cat’, ‘date) or an instance
of
SynthBin
,SynthInt
,SynthReal
,SynthCat
,SynthDate
.  param – parameters to be passed to synthetic data type if an string is given for ‘typ’. It could be a couple (a, b) for ‘int’ and ‘real’ type and just the format for ‘date’.
 clmns – a list of
 df – a

class
synthdat.
SynthDate
(data, frmt='%Y%m%d', rv=None)[source]¶ Support for datetime data;

get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value


class
synthdat.
SynthInt
(a=None, b=None, data=None, rv=None)[source]¶ Support for integer valued data;

get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value


class
synthdat.
SynthReal
(a=None, b=None, data=None, rv=None)[source]¶ Support for real valued data;

get_val
(x)[source]¶ Coverts the value of x into numeric that can be handled by random distributions
Parameters: x – the value to be converted into numeric Returns: the corresponding numeric value
