Code Documentation¶

Surrogate Random Search¶

This module provides basic functionality to optimize an expensive black-box function based on Surrogate Random Search. The Structured Random Search (SRS) method attempts to approximate an optimal solution to the following

minimize \(f(x)\)

subject to

\(g_i(x)\ge0~\) \(~i=1,\dots, m,\)

where arbitrary evaluations of \(f\) is not a viable option. The original random search itself is guarantee to converge to a local solution, but the convergence is usually very slow and most information about \(f\) is dismissed except for the best candidate. SRS tries to use all information acquired about \(f\) so far during the iterations. At \(i^{th}\) round of iteration SRS replaces \(f\) by a surrogate \(\hat{f}_i\) that enjoys many nice analytical properties which make its optimization an easier task to overcome. Then by solving the above optimization problem with \(f\) replaced by \(\hat{f}\) one gets a more informed candidate \(x_i\) for the next iteration. If a certain number of iterations do not result in a better candidate, the method returns to random sampling to collect more information about \(f\). The surrogate function \(\hat{f}_i\) can be found in many ways such as (non)linear regression, Gaussian process regression, etc. and SurrogateSearch do not have a preference. But by default it uses a polynomial regression of degree 3 if no regressor is provided. Any regressor following the architecture of scikit-learn is acceptable. Note that regressors usually require a minimum number of data points to function properly.

There are various ways for sampling a random point in feasible space which affects the performance of SRS. SurrogateSearch implements two methods: BoxSample and SphereSample. One can choose whether to shrink the volume of the box or sphere that the sample is selected from too.

class structsearch.BaseSample(**kwargs)[source]¶

This is the base class for various sampling methods.

Parameters:

init_radius – optional (default=2.); positive real number indicating the initial radius of the local search ball.
contraction – optional (default=.0); the contraction factor which must be a positive real less than 1.
ineq – optional; a list of functions whose positivity region will be the acceptable condition.
bounds – optional; the list of (ordered) tuples determining the bound of each component.

check_constraints(point)[source]¶

Checks constraints on the sample if provided

Parameters:	point – the candidate to be checked
Returns:	boolean True or False for if all constraints hold or not.

class structsearch.BoxSample(**kwargs)[source]¶

Generates samples out of a box around a given center.

Parameters:	init_radius – float the initial half-length of the edges of the sampling box; default: 2. contraction – float the contraction factor for repeated sampling.

sample(centre, cntrctn=1.0)[source]¶

Samples a point out of a box centered at centre

Parameters:	centre – numpy.array the center of the box cntrctn – float customized contraction factor
Returns:	numpy.array a new sample

class structsearch.Categorical(items, **kwargs)[source]¶

A list of possible values fr the search algorithm to choose from.

Parameters:	items – A list of possible values for a parameter

class structsearch.CompactSample(**kwargs)[source]¶

Generates samples uniformly out of a box.

sample(centre, cntrctn=1.0)[source]¶

Samples a point out of a box centered at centre

Parameters:	centre – numpy.array the center of the box cntrctn – float customized contraction factor
Returns:	numpy.array a new sample

class structsearch.HDReal(a, b, **kwargs)[source]¶

An n dimensional box of real numbers corresponding to the classification groups (e.g. class_weight). a is the list of lower bounds and b is the list of upper bounds.

Parameters:	a – a tuple of lower bounds for each dimension b – a tuple of upper bounds for each dimension

class structsearch.Integer(a=None, b=None, **kwargs)[source]¶

The range of possible values for an integer variable; a is the minimum and b is the maximum. Defaults are + and - infinity.

Parameters:	a – the lower bound for the integer interval defined by instance (accepting ‘-numpy.inf’) b – the upper bound for the integer interval defined by instance (accepting ‘numpy.inf’)

class structsearch.Real(a=None, b=None, **kwargs)[source]¶

The range of possible values for a real variable; a is the minimum and b is the maximum. Defaults are + and - infinity.

Parameters:	a – the lower bound for the (closed) interval defined by instance (accepting ‘-numpy.inf’) b – the upper bound for the (closed) interval defined by instance (accepting ‘numpy.inf’)

class structsearch.SphereSample(**kwargs)[source]¶

Generates samples out of an sphere around a given center.

Parameters:	init_radius – float the initial radius of the sampling sphere; default: 2. contraction – float the contraction factor for repeated sampling.

sample(centre, cntrctn=1.0)[source]¶

Samples a point out of an sphere centered at centre

Parameters:	centre – numpy.array the center of the sphere cntrctn – float customized contraction factor
Returns:	numpy.array a new sample

class structsearch.SurrogateRandomCV(estimator, params, scoring=None, fit_params=None, n_jobs=-1, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score=True, max_iter=50, min_evals=25, regressor=None, sampling=<class 'structsearch.CompactSample'>, radius=None, contraction=0.95, search_sphere=False, optimizer='scipy', scipy_solver='SLSQP', task_name='optim_task', warm_start=True, Continue=False, max_itr_no_prog=10000, ineqs=(), init=None)[source]¶

Surrogate Random Search optimization over hyperparameters.

The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by max_iter.

Parameters:

estimator – estimator object. An object of that type is instantiated for each search point. This object is assumed to implement the scikit-learn estimator api. Either estimator needs to provide a score function, or scoring must be passed.
params – dict Dictionary with parameters names (string) as keys and domains as lists of parameter ranges to try. Domains are either lists of categorical (string) values or 2 element lists specifying a min and max for integer or float parameters
scoring – string, callable or None, default=None A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.
max_iter – int, default=50 Number of parameter settings that are sampled. max_iter trades off runtime vs quality of the solution. Consider increasing n_points if you want to try more parameter settings in parallel.
min_evals – int, default=25; Number of random evaluations before employing an approximation for the response surface.
n_jobs – int, default=-1; number of processes to run in parallel
fit_params – dict, optional; Parameters to pass to the fit method.
pre_dispatch –
int, or string, optional; Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
cv –
int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
refit – boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this RandomizedSearchCV instance after fitting.
verbose – int, default=0 Prints internal information about the progress of each iteration.

fit(X, y=None, groups=None, **fit_params)[source]¶

Run fit with all sets of parameters.

Parameters:

X – array-like, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
y – array-like, shape = [n_samples] or [n_samples, n_output], optional; Target relative to X for classification or regression; None for unsupervised learning.
groups – array-like, with shape (n_samples,), optional; Group labels for the samples used while splitting the dataset into train/test set.
fit_params – dict of string -> object; Parameters passed to the fit method of the estimator

Returns:

self

class structsearch.SurrogateSearch(objective, **kwargs)[source]¶

An implementation of the Surrogate Random Search (SRS).

Parameters:

objective – a callable, the function to be minimized
ineq – a list of callables which represent the constraints (default: [])
task_name – str a name to refer to the optimization task, store & restore previously acquired (default: ‘optim_task’)
bounds – a list of tuples of real numbers representing the bounds on each variable; default: None
max_iter – int the maximum number of iterations (default: 50)
radius – float the initial radius of sampling region (default: 2.)
contraction – float the rate of radius contraction (default: .9)
sampling – the sampling method either BoxSample or SphereSample (default SphereSample)
search_sphere – boolean whether to fit the surrogate function on a neighbourhood of current candidate or over all sampled points (default: False)
deg – int degree of polynomial regressor if one chooses to fit polynomial surrogates (default: 3)
min_evals – int minimum number of samples before fitting a surrogate (default will be calculated as if the surrogate is a polynomial of degree 3)
regressor – a regressor (scikit-learn style) to find a surrogate
scipy_solver – str the scipy solver (‘COBYLA’ or ‘SLSQP’) to solve the local optimization problem at each iteration (default: ‘COBYLA’)
max_itr_no_prog – int maximum number of iterations with no progress (default: infinity)
Continue – boolean continues the progress from where it has been interrupted (default: False)
warm_start – boolean use data from the previous attempts, but starts from the first iteration (default: False)
verbose – boolean whether to report the progress on commandline or not (default: False)

progress()[source]¶

Generates matplotlib plots that represent distributions of each variable and the progress in minimization.

Returns:	objective’s process plot, variables’ distributions

Evolutionary Optimization Algorithm¶

class eoa.EOA(population, fitness, **kwargs)[source]¶

This is a base class acting as an umbrella to perform an evolutionary optimization algorithm.

Parameters:

population – The whole possible population as a list
fitness – The fitness evaluation. Accepts an OrderedDict of individuals with their corresponding fitness and updates their fitness
init_pop – default=`UniformRand`; The python class that initiates the initial population
recomb – default=`UniformCrossover`; The python class that defines how to combine parents to produce children
mutation – default=`Mutation`; The python class that performs mutation on offspring population
termination – default=`MaxGenTermination`; The python class that determines the termination criterion
elitism – default=`Elites`; The python class that decides how to handel elitism
num_parents – The size of initial parents population
parents_porp – default=0.1; The size of initial parents population given as a portion of whole population (only used if num_parents is not given)
elits_porp – default=0.2; The porportion of offspring to be replaced by elite parents
mutation_prob – The probability that a component will be mutated (default: 0.05)
kwargs –

class eoa.MaxGenTermination(**kwargs)[source]¶: Termination condition: Whether the maximum number of generations has been reached or not

class eoa.UniformCrossover(**kwargs)[source]¶: Recombination procedure.

class eoa.UniformRand(**kwargs)[source]¶: Initial population initiation.

Hilbert Space based regression¶

exception NpyProximation.Error(*args)[source]¶: Generic errors that may occur in the course of a run.

class NpyProximation.FunctionBasis[source]¶

This class generates two typical basis of functions: Polynomials and Trigonometric

static Fourier(n, deg, l=1.0)[source]¶

Returns the Fourier basis of degree deg in n variables with period l

Parameters:	n – number of variables deg – the maximum degree of trigonometric combinations in the basis l – the period
Returns:	the raw basis consists of trigonometric functions of degrees up to n

static Poly(n, deg)[source]¶

Returns a basis consisting of polynomials in n variables of degree at most deg.

Parameters:	n – number of variables deg – highest degree of polynomials in the basis
Returns:	the raw basis consists of polynomials of degrees up to n

class NpyProximation.FunctionSpace(dim=1, measure=None, basis=None)[source]¶

A class tha facilitates a few types of computations over function spaces of type \(L_2(X, \mu)\)

Parameters:	dim – the dimension of ‘X’ (default: 1) measure – an object of type Measure representing \(\mu\) basis – a finite basis of functions to construct a subset of \(L_2(X, \mu)\)

FormBasis()[source]¶: Call this method to generate the orthogonal basis corresponding to the given basis. The result will be stored in a property called OrthBase which is a list of function that are orthogonal to each other with respect to the measure measure over the given range domain.

Series(f)[source]¶

Given a function f, this method finds and returns the coefficients of the series that approximates f as a linear combination of the elements of the orthogonal basis \(B\). In symbols \(\sum_{b\in B}\langle f, b\rangle b\).

Returns:	the list of coefficients \(\langle f, b\rangle\) for \(b\in B\)

inner(f, g)[source]¶

Computes the inner product of the two parameters with respect to the measure measure, i.e., \(\int_Xf\cdot g d\mu\).

Parameters:	f – callable g – callable
Returns:	the quantity of \(\int_Xf\cdot g d\mu\)

project(f, g)[source]¶

Finds the projection of f on g with respect to the inner product induced by the measure measure.

Parameters:	f – callable g – callable
Returns:	the quantity of \(\frac{\langle f, g\rangle}{\\|g\\|_2}g\)

class NpyProximation.HilbertRegressor(deg=3, base=None, meas=None, fspace=None)[source]¶

Regression using Hilbert Space techniques Scikit-Learn style.

Parameters:

deg – int, default=3 The degree of polynomial regression. Only used if base is None
base – list, default = None a list of function to form an orthogonal function basis
meas – NpyProximation.Measure, default = None the measure to form the \(L_2(\mu)\) space. If None a discrete measure will be constructed based on fit inputs
fspace – NpyProximation.FunctionBasis, default = None the function subspace of \(L_2(\mu)\), if None it will be initiated according to self.meas

fit(X, y)[source]¶

Parameters:	X – Training data y – Target values
Returns:	self

predict(X)[source]¶

Predict using the Hilbert regression method

Parameters:	X – Samples
Returns:	Returns predicted values

class NpyProximation.Measure(density=None, domain=None)[source]¶

Constructs a measure \(\mu\) based on density and domain.

Parameters:

density –
the density over the domain: + if none is given, it assumes uniform distribution
- if a callable h is given, then \(d\mu=h(x)dx\)
- if a dictionary is given, then \(\mu=\sum w_x\delta_x\) a discrete measure. The points \(x\) are the keys of the dictionary (tuples) and the weights \(w_x\) are the values.
domain – if density is a dictionary, it will be set by its keys. If callable, then domain must be a list of tuples defining the domain’s box. If None is given, it will be set to \([-1, 1]^n\)

integral(f)[source]¶

Calculates \(\int_{domain} fd\mu\).

Parameters:	f – the integrand
Returns:	the value of the integral

norm(p, f)[source]¶

Computes the norm-p of the f with respect to the current measure, i.e., \((\int_{domain}|f|^p d\mu)^{1/p}\).

Parameters:	p – a positive real number f – the function whose norm is desired.
Returns:	\(\\|f\\|_{p, \mu}\)

class NpyProximation.Regression(points, dim=None)[source]¶

Given a set of points, i.e., a list of tuples of the equal lengths P, this class computes the best approximation of a function that fits the data, in the following sense:

if no extra parameters is provided, meaning that an object is initiated like R = Regression(P) then calling R.fit() returns the linear regression that fits the data.

if at initiation the parameter deg=n is set, then R.fit() returns the polynomial regression of degree n.

if a basis of functions provided by means of an OrthSystem object (R.SetOrthSys(orth)) then calling R.fit() returns the best approximation that can be found using the basic functions of the orth object.

Parameters:	points – a list of points to be fitted or a callable to be approximated dim – dimension of the domain

SetFuncSpc(sys)[source]¶

Sets the bases of the orthogonal basis

Parameters:	sys – orthsys.OrthSystem object.
Returns:	None

Note

For technical reasons, the measure needs to be given via SetMeasure method. Otherwise, the Lebesque measure on \([-1, 1]^n\) is assumed.

SetMeasure(meas)[source]¶

Sets the default measure for approximation.

Parameters:	meas – a measure.Measure object
Returns:	None

fit()[source]¶: Fits the best curve based on the optional provided orthogonal basis. If no basis is provided, it fits a polynomial of a given degree (at initiation) :return: The fit.

Sensitivity Analysis¶

Sensitivity analysis of a dataset based on a fit, sklearn style. The core functionality is provided by SALib .

class sensapprx.CorrelationThreshold(threshold=0.7)[source]¶

Selects a minimal set of features based on a given (Pearson) correlation threshold. The transformer omits the maximum number features with a high correlation and makes sure that the remaining features are not correlated behind the given threshold.

Parameters:	threshold – the threshold for selecting correlated pairs.

fit(X, y=None)[source]¶

Finds the Pearson correlation among all features, selects the pairs with absolute value of correlation above the given threshold and selects a minimal set of features with low correlation

Parameters:	X – Training data y – Target values (default: None)
Returns:	self

fit_transform(X, y=None, **fit_params)[source]¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:	X – numpy array of shape [n_samples, n_features]; Training set. y – numpy array of shape [n_samples]; Target values.
Returns:	Transformed array

class sensapprx.SensAprx(n_features_to_select=10, regressor=None, method='sobol', margin=0.2, num_smpl=512, num_levels=6, grid_jump=1, num_resmpl=8, reduce=False, domain=None, probs=None)[source]¶

Transform data to select the most secretive factors according to a regressor that fits the data.

Parameters:

n_features_to_select – int number of top features to be selected
regressor – a sklearn style regressor to fit the data for sensitivity analysis
method – str the sensitivity analysis method; defalt ‘sobol’, other options are ‘morris’ and ‘delta-mmnt’
margin – domain margine, default: .2
num_smpl – number of samples to perform the analysis, default: 512
num_levels – number of levels for morris analysis, default: 6
grid_jump – grid jump for morris analysis, default: 1
num_resmpl – number of resamples for moment independent analysis, default: 10
reduce – whether to reduce the data points to uniques and calculate the averages of the target or not, default: False
domain – pre-calculated unique points, if none, and reduce is True then unique points will be found
probs – pre-calculated values associated to domain points

fit(X, y)[source]¶

Fits the regressor to the data (X, y) and performs a sensitivity analysis on the result of the regression.

Parameters:	X – Training data y – Target values
Returns:	self

fit_transform(X, y=None, **fit_params)[source]¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:	X – numpy array of shape [n_samples, n_features]; Training set. y – numpy array of shape [n_samples]; Target values.
Returns:	Transformed array

Optimized Pipeline Detector¶

class aml.AML(config=None, length=5, scoring='accuracy', cat_cols=None, surrogates=None, min_random_evals=15, cv=None, check_point='./', stack_res=True, stack_probs=True, stack_decision=True, verbose=1, n_jobs=-1)[source]¶

A class that accepts a nested dictionary with machine learning libraries as its keys and a dictionary of their parameters and their ranges as value of each key and finds an optimum combination based on training data.

Parameters:

config – A dictionary whose keys are scikit-learn-style objects (as strings) and its corresponding values are dictionaries of the parameters and their acceptable ranges/values
length – default=5; Maximum number of objects in generated pipelines
scoring – default=’accuracy’; The scoring method to be optimized. Must follow the sklearn scoring signature
cat_cols – default=None; The list of indices of categorical columns
surrogates – default=None; A list of 4-tuples determining surrogates. The first entity of each tuple is a scikit-learn regressor and the 2nd entity is the number of iterations that this surrogate needs to be estimated and optimized. The 3rd is the sampling strategy and the 4th is the scipy.optimize solver
min_random_evals – default=15; Number of randomly sampled initial values for hyper parameters
cv – default=`ShuffleSplit(n_splits=3, test_size=.25); The cross validation method
check_point – default=’./’; The path where the optimization results will be stored
stack_res – default=True; StackingEstimator`s `res
stack_probs – default=True; StackingEstimator`s `probs
stack_decision – default=True; StackingEstimator`s `decision
verbose – default=1; Level of output details
n_jobs – int, default=-1; number of processes to run in parallel

add_surrogate(estimator, itrs, sampling=None, optim='L-BFGS-B')[source]¶

Adding a regressor for surrogate optimization procedure.

Parameters:	estimator – A scikit-learn style regressor itrs – Number of iterations the estimator needs to be fitted and optimized sampling – default= BoxSample; The sampling strategy (CompactSample, BoxSample or SphereSample) optim – default=’L-BFGS-B’;`scipy.optimize` solver
Returns:	None

eoa_fit(X, y, **kwargs)[source]¶

Applies evolutionary optimization methods to find an optimum pipeline

Parameters:	X – Training data y – Corresponding observations kwargs – EOA parameters
Returns:	self

fit(X, y)[source]¶

Generates and optimizes all legitimate pipelines. The best pipeline can be retrieved from self.best_estimator_

Parameters:	X – Training data y – Corresponding observations
Returns:	self

get_top(num=5)[source]¶

Finds the top n pipelines

Parameters:	num – Number of pipelines to be returned
Returns:	An OrderedDict of top models

optimize_pipeline(seq, X, y)[source]¶

Constructs and optimizes a pipeline according to the steps passed through seq which is a tuple of estimators and transformers.

Parameters:	seq – the tuple of steps of the pipeline to be optimized X – numpy array of training features y – numpy array of training values
Returns:	the optimized pipeline and its score

types()[source]¶

Recognizes the type of each estimator to determine proper placement of each

Returns:	None

class aml.StackingEstimator(estimator, res=True, probs=True, decision=True)[source]¶

Meta-transformer for adding predictions and/or class probabilities as synthetic feature(s).

Parameters:	estimator – object with fit, predict, and predict_proba methods. The estimator to generate synthetic features from. res – True (default), stacks the final result of estimator probs – True (default), stacks probabilities calculated by estimator decision – True (default), stacks the result of decision function of the estimator

fit(X, y=None, **fit_params)[source]¶

Fit the StackingEstimator meta-transformer.

Parameters:	X – array-like of shape (n_samples, n_features). The training input samples. y – array-like, shape (n_samples,). The target values (integers that correspond to classes in classification, real numbers in regression). fit_params – Other estimator-specific parameters.
Returns:	self, object. Returns a copy of the estimator

set_params(**params)[source]¶

Sets the sklearn related parameters for the estimator

Parameters:	params – parameters to be bassed to the estimator
Returns:	self

transform(X)[source]¶

Transform data by adding two synthetic feature(s).

Parameters:	X – numpy ndarray, {n_samples, n_components}. New data, where n_samples is the number of samples and n_components is the number of components.
Returns:	X_transformed: array-like, shape (n_samples, n_features + 1) or (n_samples, n_features + 1 + n_classes) for classifier with predict_proba attribute; The transformed feature set.

class aml.Words(letters, last=None, first=None, repeat=False)[source]¶

This class takes a set as alphabet and generates words of a given length accordingly. A Words instant accepts the following parameters:

Parameters:	letters – is a set of letters (symbols) to make up the words last – a subset of letters that are allowed to appear at the end of a word first – a set of words that can only appear at the beginning of a word repeat – whether consecutive occurrence of a letter is allowed

Generate(l)[source]¶

Generates the set of legitimate words of length l

Parameters:	l – int, the length of words
Returns:	set of all legitimate words of length l