Tuning hyper-parameters with CodeFlare Pipelines

GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a fit() method and a score() method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid. Here we show how to convert an example of using GridSearchCV() to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines grid_search_cv(). We use the Pipelining: chaining a PCA and a logistic regression from sklearn pipelines as an example.

In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in [5, 15, 30, 45, 64] and C defined by np.logspace(-4, 4, 4). A total of 20 combinations of n_components and C parameter values will be explored by GridSearchCV() to find the best one with the highest mean_test_score.

pca = PCA()
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

X_digits, y_digits = datasets.load_digits(return_X_y=True)

param_grid = {
    'pca__n_components': [5, 15, 30, 45, 64],
    'logistic__C': np.logspace(-4, 4, 4),
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)

After running GridSearchCV().fit(), the best parameters of PCA__n_components and LogisticRegression__C, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA.

Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}

The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step['pca'].n_components from the returned object of GridSearchCV().