2015-09-21

This post will cover a few things needed to quickly implement a fast, principled method for machine learning model parameter tuning. There are two common methods of parameter tuning: grid search and random search. Each have their pros and cons. Grid search is slow but effective at searching the whole search space, while random search is fast, but could miss important points in the search space. Luckily, a third option exists: Bayesian optimization. In this post, we will focus on one implementation of Bayesian optimization, a Python module called hyperopt.

Using Bayesian optimization for parameter tuning allows us to obtain the best parameters for a given model, e.g., logistic regression. This also allows us to perform optimal model selection. Typically, a machine learning engineer or data scientist will perform some form of manual parameter tuning (grid search or random search) for a few models - like decision tree, support vector machine, and k nearest neighbors - then compare the accuracy scores and select the best one for use. This method has the possibility of comparing sub-optimal models. Maybe the data scientist found the optimal parameters for the decision tree, but missed the optimal parameters for SVM. This means their model comparison was flawed. K nearest neighbors may beat SVM every time if the SVM parameters are poorly tuned. Bayesian optimization allow the data scientist to find the best parameters for all models, and therefore compare the best models. This results in better model selection, because you are comparing the best k nearest neighbors to the best decision tree. Only in this way can you do model selection with high confidence, assured that the actual best model is selected and used.

Topics covered are in this post are:

Objective functions

Search spaces

Storing evaluation trials

Visualization

Full example on a classic dataset: Iris

To use the code below, you must install hyperopt and pymongo.

Objective Functions - A Motivating Example

Suppose you have a function defined over some range, and you want to minimize it. That is, you want to find the input value that result in the lowest output value. The trivial example below finds the value of x that minimizes a linear function y(x) = x.

Let's break this down.

The function fmin first takes a function to minimize, denoted fn, which we here specify with an anonymous function lambda x: x. This function could be any valid value-returning function, such as mean absolute error in regression.

The next parameter specifies the search space, and in this example it is the continuous range of numbers between 0 and 1, specified by hp.uniform('x', 0, 1). hp.uniform is a built-in hyperopt function that takes three parameters: the name, x, and the lower and upper bound of the range, 0 and 1.

The parameter algo takes a search algorithm, in this case tpe which stands for tree of Parzen estimators. This topic is beyond the scope of this blog post, but the mathochistic reader may peruse this for details. The algo parameter can also be set to hyperopt.random, but we do not cover that here as it is widely known search strategy. However, in a future post, we can.

Finally, we specify the maximum number of evaluations max_evals the fmin function will perform. This fmin function returns a python dictionary of values.

An example of the output for the function above is {'x': 0.000269455723739237}.

Here is the plot of the function. The red dot is the point we are trying to find.



More Complicated Examples

Here is a more complicated objective function: lambda x: (x-1)**2. This time we are trying to minimize a quadratic equation y(x) = (x-1)**2. So we alter the search space to include what we know to be the optimal value (x=1) plus some sub-optimal ranges on either side: hp.uniform('x', -2, 2).

Now we have:

The output should look something like this:

Here is the plot.



Instead of minimizing an objective function, maybe we want to maximize it. To to this we need only return the negative of the function. For example, we could have a function y(x) = -(x**2):



How could we go about solving this? We just take the objective function lambda x: -(x**2) and return the negative, giving lambda x: -1*-(x**2) or just lambda x: (x**2).

Here is one similar to example 1, but instead of minimizing, we are trying to maximize.

Here is a function with many (infinitely many given an infinite range) local minima, which we are also trying to maximize:

Search Spaces

The hyperopt module includes a few handy functions to specify ranges for input parameters. We have already seen hp.uniform. Initially, these are stochastic search spaces, but as hyperopt learns more (as it gets more feedback from the objective function), it adapts and samples different parts of the initial search space that it thinks will give it the most meaningful feedback.

The following will be used in this post:

hp.choice(label, options) where options should be a python list or tuple.

hp.normal(label, mu, sigma) where mu and sigma are the mean and standard deviation, respectively.

hp.uniform(label, low, high) where low and high are the lower and upper bounds on the range.

Others are available, such as hp.normal, hp.lognormal, hp.quniform, but we will not use them here.

To see some draws from the search space, we should import another function, and define the search space.

An example output is:

Try running this a few times and to see the different samples.

Capturing Info with Trials

It would be nice to see exactly what is happening inside the hyperopt black box. The Trials object allows us to do just that. We need only import a few more items.

The STATUS_OK and Trials imports are new. The Trials object allows us to store info at each time step they are stored. We can then print them out and see what the evaluations of the function were for a given parameter at a given time step.

Here is an example output of the code above:

The trials object stores data as a BSON object, which works just like a JSON object. BSON is from the pymongo module. We will not discuss the details here, but there are advanced options for hyperopt that require distributed computing using MongoDB, hence the pymongo import.

Back to the output above. The 'tid' is the time id, that is, the time step, which goes from 0 to max_evals-1. It increases by one each iteration. 'x' is in the 'vals' key, which is where your parameters are stored for each iteration. 'loss' is in the 'result' key, which gives us the value for our objective function at that iteration.

Let's look at this in another way.

Visualization

We'll go over two types of visualizations here: val vs. time, and loss vs. val. First, val vs. time. Below is the code and sample output for plotting the trials.trials data described above.

The output should look like this, assuming we change max_evals to 1000.

We can see that initially the algorithm picks values from the whole range equally (uniformly), but as time goes on and more is learned about the parameter's effect on the objective function, the algorithm focuses more and more on areas in which it thinks it will gain the most - the range close to zero. It still explores the whole solution space, but less frequently.

Now let's look at the plot of loss vs. val.

This gives us what we expect, since the function y(x) = x**2 is deterministic.

To wrap up, let's try a more complicated example, with more randomness and more parameters.

The Iris Dataset

In this section, we'll walk through 4 full examples of using hyperopt for parameter tuning on a classic dataset, Iris. We will cover K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees, and Random Forests. Note that since we are trying to maximize the cross-validation accuracy (acc in the code below), we must negate this value for hyperopt, since hyperopt only knows how to minimize a function. Minimizing a function f is the same as maximizing the negative of f.

For this task, we'll use the classic Iris data set, and do some supervised machine learning. There are 4 input features, and three output classes. The data are labeled as belonging to class 0, 1, or 2, which map to different kinds of Iris flower. The input has 4 columns: sepal length, sepal width, petal length, and pedal width. Units of the input are centimeters. We will use these 4 features to learn a model that predicts one of three output classes. Since the data is provided by sklearn, it has a nice DESCR attribute that provides details on the data set. Try the following for more details.

Let's get to know the data a little better through visualization of the features and classes, using the code below. Don't forget to pip install seaborn if you have not already.

Here is the plot:

K-Nearest Neighbors

We now apply hyperopt to finding the best parameters to a K-Nearest Neighbor (KNN) machine learning model. The KNN model classifies a data point from the test set based on majority class of the k nearest data points in the training data set. More information on this algorithm can he found here.

The code below incorporates everything we have covered.

Now let's see the plot of the output. The y axis is the cross validation score, and the x axis is the k value in k-nearest-neighbors. Here is the code and its image:

After k is greater than 63, the accuracy drops precipitously. This is due to the number of each class in the dataset. There are only 50 instances of each of the three classes. So let's drill down by limiting the values of 'n_neighbors' to smaller values.

Here is what we get when we run the same code for visualization:

Now we can see clearly that there is a best value for k, at k = 4.

The model above does not do any preprocessing. So let's normalize and scale our features and see if that helps. Use this code:

And plot the parameters like this:

We see that scaling and/or normalizing the data does not improve predictive accuracy. The best value of k remains 4, which results in 98.6 % accuracy.

So this is great for parameter tuning a simple model, KNN. Let's see what we can do with Support Vector Machines (SVM).

Support Vector Machines (SVM)

Since this is a classification task, we'll use sklearn's SVC class. Here is the code:

Here is what we get:

Again, scaling and normalizing do not help. The first choice of kernel funcion is the best (linear), the best C value is 1.4168540399911616, and the best gamma is 15.04230279483486. This set of parameters results in 99.3 % classification accuracy.

Decision Trees

We will only attempt to optimize on a few parameters of decision trees. Here is the code.

iris = datasets.load_iris()
X_original = iris.data
y_original = iris.target

def hyperopt_train_test(params):
X_ = X[:]
if 'normalize' in params:
if params['normalize'] == 1:
X_ = normalize(X_)
del params['normalize']

if 'scale' in params:
if params['scale'] == 1:
<span class="n

Show more