Code Documentation¶
Surrogate Random Search¶
This module provides basic functionality to optimize an expensive black-box function based on Surrogate Random Search. The Structured Random Search (SRS) method attempts to approximate an optimal solution to the following
minimize \(f(x)\)
- subject to
- \(g_i(x)\ge0~\) \(~i=1,\dots, m,\)
where arbitrary evaluations of \(f\) is not a viable option. The original random search itself is guarantee to converge to a local solution, but the convergence is usually very slow and most information about \(f\) is dismissed except for the best candidate. SRS tries to use all information acquired about \(f\) so far during the iterations. At \(i^{th}\) round of iteration SRS replaces \(f\) by a surrogate \(\hat{f}_i\) that enjoys many nice analytical properties which make its optimization an easier task to overcome. Then by solving the above optimization problem with \(f\) replaced by \(\hat{f}\) one gets a more informed candidate \(x_i\) for the next iteration. If a certain number of iterations do not result in a better candidate, the method returns to random sampling to collect more information about \(f\). The surrogate function \(\hat{f}_i\) can be found in many ways such as (non)linear regression, Gaussian process regression, etc. and SurrogateSearch do not have a preference. But by default it uses a polynomial regression of degree 3 if no regressor is provided. Any regressor following the architecture of scikit-learn is acceptable. Note that regressors usually require a minimum number of data points to function properly.
There are various ways for sampling a random point in feasible space which affects the performance of SRS. SurrogateSearch implements two methods: BoxSample and SphereSample. One can choose whether to shrink the volume of the box or sphere that the sample is selected from too.
-
class
structsearch.
BaseSample
(**kwargs)[source]¶ This is the base class for various sampling methods.
Parameters: - init_radius – optional (default=2.); positive real number indicating the initial radius of the local search ball.
- contraction – optional (default=.0); the contraction factor which must be a positive real less than 1.
- ineq – optional; a list of functions whose positivity region will be the acceptable condition.
- bounds – optional; the list of (ordered) tuples determining the bound of each component.
-
class
structsearch.
BoxSample
(**kwargs)[source]¶ Generates samples out of a box around a given center.
Parameters: - init_radius – float the initial half-length of the edges of the sampling box; default: 2.
- contraction – float the contraction factor for repeated sampling.
-
class
structsearch.
Categorical
(items, **kwargs)[source]¶ A list of possible values fr the search algorithm to choose from.
Parameters: items – A list of possible values for a parameter
-
class
structsearch.
HDReal
(a, b, **kwargs)[source]¶ An n dimensional box of real numbers corresponding to the classification groups (e.g. class_weight). a is the list of lower bounds and b is the list of upper bounds.
Parameters: - a – a tuple of lower bounds for each dimension
- b – a tuple of upper bounds for each dimension
-
class
structsearch.
Integer
(a=None, b=None, **kwargs)[source]¶ The range of possible values for an integer variable; a is the minimum and b is the maximum. Defaults are + and - infinity.
Parameters: - a – the lower bound for the integer interval defined by instance (accepting ‘-numpy.inf’)
- b – the upper bound for the integer interval defined by instance (accepting ‘numpy.inf’)
-
class
structsearch.
Real
(a=None, b=None, **kwargs)[source]¶ The range of possible values for a real variable; a is the minimum and b is the maximum. Defaults are + and - infinity.
Parameters: - a – the lower bound for the (closed) interval defined by instance (accepting ‘-numpy.inf’)
- b – the upper bound for the (closed) interval defined by instance (accepting ‘numpy.inf’)
-
class
structsearch.
SphereSample
(**kwargs)[source]¶ Generates samples out of an sphere around a given center.
Parameters: - init_radius – float the initial radius of the sampling sphere; default: 2.
- contraction – float the contraction factor for repeated sampling.
-
class
structsearch.
SurrogateRandomCV
(estimator, params, scoring=None, fit_params=None, n_jobs=-1, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score=True, max_iter=50, min_evals=25, regressor=None, sampling=<class 'structsearch.CompactSample'>, radius=None, contraction=0.95, search_sphere=False, optimizer='scipy', scipy_solver='SLSQP', task_name='optim_task', warm_start=True, Continue=False, max_itr_no_prog=10000, ineqs=(), init=None)[source]¶ Surrogate Random Search optimization over hyperparameters.
The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by max_iter.
Parameters: - estimator – estimator object. An object of that type is instantiated for each search point. This object is
assumed to implement the scikit-learn estimator api. Either estimator needs to provide a
score
function, orscoring
must be passed. - params – dict Dictionary with parameters names (string) as keys and domains as lists of parameter ranges to try. Domains are either lists of categorical (string) values or 2 element lists specifying a min and max for integer or float parameters
- scoring – string, callable or None, default=None
A string (see model evaluation documentation) or a scorer callable
object / function with signature
scorer(estimator, X, y)
. IfNone
, thescore
method of the estimator is used. - max_iter – int, default=50
Number of parameter settings that are sampled. max_iter trades
off runtime vs quality of the solution. Consider increasing
n_points
if you want to try more parameter settings in parallel. - min_evals – int, default=25; Number of random evaluations before employing an approximation for the response surface.
- n_jobs – int, default=-1; number of processes to run in parallel
- fit_params – dict, optional; Parameters to pass to the fit method.
- pre_dispatch –
int, or string, optional; Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
- An int, giving the exact number of total jobs that are spawned
- A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
- cv –
int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. - refit – boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this RandomizedSearchCV instance after fitting.
- verbose – int, default=0 Prints internal information about the progress of each iteration.
-
fit
(X, y=None, groups=None, **fit_params)[source]¶ Run fit with all sets of parameters.
Parameters: - X – array-like, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
- y – array-like, shape = [n_samples] or [n_samples, n_output], optional; Target relative to X for classification or regression; None for unsupervised learning.
- groups – array-like, with shape (n_samples,), optional; Group labels for the samples used while splitting the dataset into train/test set.
- fit_params – dict of string -> object; Parameters passed to the fit method of the estimator
Returns: self
- estimator – estimator object. An object of that type is instantiated for each search point. This object is
assumed to implement the scikit-learn estimator api. Either estimator needs to provide a
-
class
structsearch.
SurrogateSearch
(objective, **kwargs)[source]¶ An implementation of the Surrogate Random Search (SRS).
Parameters: - objective – a callable, the function to be minimized
- ineq – a list of callables which represent the constraints (default: [])
- task_name – str a name to refer to the optimization task, store & restore previously acquired (default: ‘optim_task’)
- bounds – a list of tuples of real numbers representing the bounds on each variable; default: None
- max_iter – int the maximum number of iterations (default: 50)
- radius – float the initial radius of sampling region (default: 2.)
- contraction – float the rate of radius contraction (default: .9)
- sampling – the sampling method either BoxSample or SphereSample (default SphereSample)
- search_sphere – boolean whether to fit the surrogate function on a neighbourhood of current candidate or over all sampled points (default: False)
- deg – int degree of polynomial regressor if one chooses to fit polynomial surrogates (default: 3)
- min_evals – int minimum number of samples before fitting a surrogate (default will be calculated as if the surrogate is a polynomial of degree 3)
- regressor – a regressor (scikit-learn style) to find a surrogate
- scipy_solver – str the scipy solver (‘COBYLA’ or ‘SLSQP’) to solve the local optimization problem at each iteration (default: ‘COBYLA’)
- max_itr_no_prog – int maximum number of iterations with no progress (default: infinity)
- Continue – boolean continues the progress from where it has been interrupted (default: False)
- warm_start – boolean use data from the previous attempts, but starts from the first iteration (default: False)
- verbose – boolean whether to report the progress on commandline or not (default: False)
Evolutionary Optimization Algorithm¶
-
class
eoa.
EOA
(population, fitness, **kwargs)[source]¶ This is a base class acting as an umbrella to perform an evolutionary optimization algorithm.
Parameters: - population – The whole possible population as a list
- fitness – The fitness evaluation. Accepts an OrderedDict of individuals with their corresponding fitness and updates their fitness
- init_pop – default=`UniformRand`; The python class that initiates the initial population
- recomb – default=`UniformCrossover`; The python class that defines how to combine parents to produce children
- mutation – default=`Mutation`; The python class that performs mutation on offspring population
- termination – default=`MaxGenTermination`; The python class that determines the termination criterion
- elitism – default=`Elites`; The python class that decides how to handel elitism
- num_parents – The size of initial parents population
- parents_porp – default=0.1; The size of initial parents population given as a portion of whole population (only used if num_parents is not given)
- elits_porp – default=0.2; The porportion of offspring to be replaced by elite parents
- mutation_prob – The probability that a component will be mutated (default: 0.05)
- kwargs –
Hilbert Space based regression¶
-
exception
NpyProximation.
Error
(*args)[source]¶ Generic errors that may occur in the course of a run.
-
class
NpyProximation.
FunctionBasis
[source]¶ This class generates two typical basis of functions: Polynomials and Trigonometric
-
static
Fourier
(n, deg, l=1.0)[source]¶ Returns the Fourier basis of degree deg in n variables with period l
Parameters: - n – number of variables
- deg – the maximum degree of trigonometric combinations in the basis
- l – the period
Returns: the raw basis consists of trigonometric functions of degrees up to n
-
static
-
class
NpyProximation.
FunctionSpace
(dim=1, measure=None, basis=None)[source]¶ A class tha facilitates a few types of computations over function spaces of type \(L_2(X, \mu)\)
Parameters: - dim – the dimension of ‘X’ (default: 1)
- measure – an object of type Measure representing \(\mu\)
- basis – a finite basis of functions to construct a subset of \(L_2(X, \mu)\)
-
FormBasis
()[source]¶ Call this method to generate the orthogonal basis corresponding to the given basis. The result will be stored in a property called
OrthBase
which is a list of function that are orthogonal to each other with respect to the measuremeasure
over the given rangedomain
.
-
Series
(f)[source]¶ Given a function f, this method finds and returns the coefficients of the series that approximates f as a linear combination of the elements of the orthogonal basis \(B\). In symbols \(\sum_{b\in B}\langle f, b\rangle b\).
Returns: the list of coefficients \(\langle f, b\rangle\) for \(b\in B\)
-
class
NpyProximation.
HilbertRegressor
(deg=3, base=None, meas=None, fspace=None)[source]¶ Regression using Hilbert Space techniques Scikit-Learn style.
Parameters: - deg – int, default=3 The degree of polynomial regression. Only used if base is None
- base – list, default = None a list of function to form an orthogonal function basis
- meas – NpyProximation.Measure, default = None the measure to form the \(L_2(\mu)\) space. If None a discrete measure will be constructed based on fit inputs
- fspace – NpyProximation.FunctionBasis, default = None the function subspace of \(L_2(\mu)\), if None it will be initiated according to self.meas
-
class
NpyProximation.
Measure
(density=None, domain=None)[source]¶ Constructs a measure \(\mu\) based on density and domain.
Parameters: - density –
the density over the domain: + if none is given, it assumes uniform distribution
- if a callable h is given, then \(d\mu=h(x)dx\)
- if a dictionary is given, then \(\mu=\sum w_x\delta_x\) a discrete measure. The points \(x\) are the keys of the dictionary (tuples) and the weights \(w_x\) are the values.
- domain – if density is a dictionary, it will be set by its keys. If callable, then domain must be a list of tuples defining the domain’s box. If None is given, it will be set to \([-1, 1]^n\)
- density –
-
class
NpyProximation.
Regression
(points, dim=None)[source]¶ Given a set of points, i.e., a list of tuples of the equal lengths P, this class computes the best approximation of a function that fits the data, in the following sense:
- if no extra parameters is provided, meaning that an object is initiated like
R = Regression(P)
then callingR.fit()
returns the linear regression that fits the data. - if at initiation the parameter deg=n is set, then
R.fit()
returns the polynomial regression of degree n. - if a basis of functions provided by means of an OrthSystem object (
R.SetOrthSys(orth)
) then callingR.fit()
returns the best approximation that can be found using the basic functions of the orth object.
Parameters: - points – a list of points to be fitted or a callable to be approximated
- dim – dimension of the domain
-
SetFuncSpc
(sys)[source]¶ Sets the bases of the orthogonal basis
Parameters: sys – orthsys.OrthSystem object. Returns: None Note
For technical reasons, the measure needs to be given via SetMeasure method. Otherwise, the Lebesque measure on \([-1, 1]^n\) is assumed.
- if no extra parameters is provided, meaning that an object is initiated like
Sensitivity Analysis¶
Sensitivity analysis of a dataset based on a fit, sklearn style. The core functionality is provided by SALib .
-
class
sensapprx.
CorrelationThreshold
(threshold=0.7)[source]¶ Selects a minimal set of features based on a given (Pearson) correlation threshold. The transformer omits the maximum number features with a high correlation and makes sure that the remaining features are not correlated behind the given threshold.
Parameters: threshold – the threshold for selecting correlated pairs. -
fit
(X, y=None)[source]¶ Finds the Pearson correlation among all features, selects the pairs with absolute value of correlation above the given threshold and selects a minimal set of features with low correlation
Parameters: - X – Training data
- y – Target values (default: None)
Returns: self
-
fit_transform
(X, y=None, **fit_params)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X – numpy array of shape [n_samples, n_features]; Training set.
- y – numpy array of shape [n_samples]; Target values.
Returns: Transformed array
-
-
class
sensapprx.
SensAprx
(n_features_to_select=10, regressor=None, method='sobol', margin=0.2, num_smpl=512, num_levels=6, grid_jump=1, num_resmpl=8, reduce=False, domain=None, probs=None)[source]¶ Transform data to select the most secretive factors according to a regressor that fits the data.
Parameters: - n_features_to_select – int number of top features to be selected
- regressor – a sklearn style regressor to fit the data for sensitivity analysis
- method – str the sensitivity analysis method; defalt ‘sobol’, other options are ‘morris’ and ‘delta-mmnt’
- margin – domain margine, default: .2
- num_smpl – number of samples to perform the analysis, default: 512
- num_levels – number of levels for morris analysis, default: 6
- grid_jump – grid jump for morris analysis, default: 1
- num_resmpl – number of resamples for moment independent analysis, default: 10
- reduce – whether to reduce the data points to uniques and calculate the averages of the target or not, default: False
- domain – pre-calculated unique points, if none, and reduce is True then unique points will be found
- probs – pre-calculated values associated to domain points
-
fit
(X, y)[source]¶ Fits the regressor to the data (X, y) and performs a sensitivity analysis on the result of the regression.
Parameters: - X – Training data
- y – Target values
Returns: self
-
fit_transform
(X, y=None, **fit_params)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X – numpy array of shape [n_samples, n_features]; Training set.
- y – numpy array of shape [n_samples]; Target values.
Returns: Transformed array
Optimized Pipeline Detector¶
-
class
aml.
AML
(config=None, length=5, scoring='accuracy', cat_cols=None, surrogates=None, min_random_evals=15, cv=None, check_point='./', stack_res=True, stack_probs=True, stack_decision=True, verbose=1, n_jobs=-1)[source]¶ A class that accepts a nested dictionary with machine learning libraries as its keys and a dictionary of their parameters and their ranges as value of each key and finds an optimum combination based on training data.
Parameters: - config – A dictionary whose keys are scikit-learn-style objects (as strings) and its corresponding values are dictionaries of the parameters and their acceptable ranges/values
- length – default=5; Maximum number of objects in generated pipelines
- scoring – default=’accuracy’; The scoring method to be optimized. Must follow the sklearn scoring signature
- cat_cols – default=None; The list of indices of categorical columns
- surrogates – default=None; A list of 4-tuples determining surrogates. The first entity of each tuple is a scikit-learn regressor and the 2nd entity is the number of iterations that this surrogate needs to be estimated and optimized. The 3rd is the sampling strategy and the 4th is the scipy.optimize solver
- min_random_evals – default=15; Number of randomly sampled initial values for hyper parameters
- cv – default=`ShuffleSplit(n_splits=3, test_size=.25); The cross validation method
- check_point – default=’./’; The path where the optimization results will be stored
- stack_res – default=True; StackingEstimator`s `res
- stack_probs – default=True; StackingEstimator`s `probs
- stack_decision – default=True; StackingEstimator`s `decision
- verbose – default=1; Level of output details
- n_jobs – int, default=-1; number of processes to run in parallel
-
add_surrogate
(estimator, itrs, sampling=None, optim='L-BFGS-B')[source]¶ Adding a regressor for surrogate optimization procedure.
Parameters: - estimator – A scikit-learn style regressor
- itrs – Number of iterations the estimator needs to be fitted and optimized
- sampling – default= BoxSample; The sampling strategy (CompactSample, BoxSample or SphereSample)
- optim – default=’L-BFGS-B’;`scipy.optimize` solver
Returns: None
-
eoa_fit
(X, y, **kwargs)[source]¶ Applies evolutionary optimization methods to find an optimum pipeline
Parameters: - X – Training data
- y – Corresponding observations
- kwargs – EOA parameters
Returns: self
-
fit
(X, y)[source]¶ Generates and optimizes all legitimate pipelines. The best pipeline can be retrieved from self.best_estimator_
Parameters: - X – Training data
- y – Corresponding observations
Returns: self
-
get_top
(num=5)[source]¶ Finds the top n pipelines
Parameters: num – Number of pipelines to be returned Returns: An OrderedDict of top models
-
optimize_pipeline
(seq, X, y)[source]¶ Constructs and optimizes a pipeline according to the steps passed through seq which is a tuple of estimators and transformers.
Parameters: - seq – the tuple of steps of the pipeline to be optimized
- X – numpy array of training features
- y – numpy array of training values
Returns: the optimized pipeline and its score
-
class
aml.
StackingEstimator
(estimator, res=True, probs=True, decision=True)[source]¶ Meta-transformer for adding predictions and/or class probabilities as synthetic feature(s).
Parameters: - estimator – object with fit, predict, and predict_proba methods. The estimator to generate synthetic features from.
- res – True (default), stacks the final result of estimator
- probs – True (default), stacks probabilities calculated by estimator
- decision – True (default), stacks the result of decision function of the estimator
-
fit
(X, y=None, **fit_params)[source]¶ Fit the StackingEstimator meta-transformer.
Parameters: - X – array-like of shape (n_samples, n_features). The training input samples.
- y – array-like, shape (n_samples,). The target values (integers that correspond to classes in classification, real numbers in regression).
- fit_params – Other estimator-specific parameters.
Returns: self, object. Returns a copy of the estimator
-
set_params
(**params)[source]¶ Sets the sklearn related parameters for the estimator
Parameters: params – parameters to be bassed to the estimator Returns: self
-
transform
(X)[source]¶ Transform data by adding two synthetic feature(s).
Parameters: X – numpy ndarray, {n_samples, n_components}. New data, where n_samples is the number of samples and n_components is the number of components. Returns: X_transformed: array-like, shape (n_samples, n_features + 1) or (n_samples, n_features + 1 + n_classes) for classifier with predict_proba attribute; The transformed feature set.
-
class
aml.
Words
(letters, last=None, first=None, repeat=False)[source]¶ This class takes a set as alphabet and generates words of a given length accordingly. A Words instant accepts the following parameters:
Parameters: - letters – is a set of letters (symbols) to make up the words
- last – a subset of letters that are allowed to appear at the end of a word
- first – a set of words that can only appear at the beginning of a word
- repeat – whether consecutive occurrence of a letter is allowed