February 12, 2021
Trends in Hyperparameter Searching
If you’ve ever trained an ML model from the scikit-learn python library, you’ve probably noticed that each model has certain options that can be specified prior to training. For example, the Decision Tree classifier lets you specify how many features to search at each split with the "max_features" parameter.
The scikit-learn library provides default values for these options, but those values might not be optimal for your particular dataset. In scikit-learn, "max_features" of a Decision Tree is set to the number of features of the dataset, but if you're working with genome data with 100,000 features, it may be too tedious to search all of them at each split of the decision tree. These options, or parameters, that are set before training the model are referred to as “hyperparameters”.
Changing the value of a hyperparameter will affect the performance of the model, but a new model needs to be trained each time one is changed.
After deciding which hyperparameter configurations to search through, finding the optimal is simple, one just needs to evaluate the performance of each combination of them and pick the best. Seldomly, however, do we have enough time and/or computing resources to search all of the combinations. One alternative is the random search which randomly selects a subset of the search space instead  which can also avoid over-searching even in times when we can search all of the space. On the other hand, the search doesn't have to be uninformed. For example, if we observe that incrementing a hyperparameter consistently results in overfitting, instead of increasing it further, we might explore lower values.
Hyperparameter Search Algorithms
There are hyperparameter search algorithms that can select which configurations to search next based on the performance of the previous selections and/or save time by cutting short the training of the models with unpromising hyperparameters.
For example, algorithms that use Bayesian optimization model how the hyperparameters affect the performance and search the ones that are more likely to yield a high performance. That modeled relationship is constantly refined as new hyperparameters are evaluated. And, a more recent approach Hyperband uses the idea that a low performing hyperparameter can be detected before the model training process converges; therefore, the training can be halted to save resources.
Although using these algorithms results in overhead compared to random search, if training each configuration is expensive enough (as in the case of training deep learning models) then it can be faster to use them. In the Hyperband paper , the authors show that for the random search to identify a configuration with comparable performance to the Hyperband, twenty times more resources are required when training a deep convolutional neural network and seventy times more when training a kernel-based classifier.
For the practician, there are libraries that provide an interface between hyperparameter search methods and machine learning models. Ray Tune , which we're going to focus on and optuna  are two such libraries.
Ray Tune is agnostic to the library that the ML model is written in. And, in the simplest setting, the only requirements are to be able to set the hyperparameters of the model and obtain its performance in return. They provide examples to use it with popular ML frameworks such as PyTorch and XGBoost. Furthermore, for scikit-learn there are ready-to-use wrappers.
Ray Tune has native implementations of hyperparameter algorithms such as HyperBand and population-based training, and it supports using other open-source libraries through wrappers, such as the Bayesian Optimization library.
Even if one is interested in simply doing a random search, Ray Tune can help with organizing the results of the grid search and distributing the search to multiple machines.
More recently, the popular NLP library huggingface/transformers introduced a hyperparameter searching functionality to their trainers through Ray Tune and optuna. In the example blog post, they show how, with an additional line of code, they can perform hyperparameter search for fine-tuning the BERT model on the Microsoft Research Paraphrase Corpus.
Kevin Murphy, in his bayesnet github page, shares the quotation of Alfred North Whitehead which states "Civilization advances by extending the number of important operations that we can do without thinking about them." Although one still needs to think and define the hyperparameter search space through the understanding of their functionality, any automation in the search is definitely welcomed. The recent approaches in hyperparameter optimization show promising speedups compared to random search. Ray Tune, which is open-source, provides a simple interface between its hyperparameter search algorithms and the machine learning engineer’s models. Furthermore, its support for organizing the results of different training runs can itself be a motivation to take the step and try the library.
- Bergstra, J. and Y. Bengio, Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 2012. 13(1): p. 281-305.
- Li, L., et al., Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 2017. 18(1): p. 6765-6816.
- Liaw, R., et al., Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
- Akiba, T., et al. Optuna: A next-generation hyperparameter optimization framework. in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.