2. Optimisation

Optimization, or model training, refers to the process of adjusting the values that make up the model (including both parameters and hyperparameters) to enhance the model's performance in solving a given problem. In this section, we will focus on challenges that arise from selecting suboptimal optimization strategies.

2.1 Algorithm

Since algorithms take input data and produce output, typically solving a particular problem or achieving a specific objective, it is essential to know which one is implemented in a study. In this way we can have better insights for the results of learning patterns, relationships, or rules that can then be applied to new, unseen data. Regarding ML class there are three major categories:

Supervised (i.e. Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM) and others),
Unsupervised Learning (i.e. K-Means Clustering, Principal Component Analysis (PCA) and Hierarchical Clustering and others),
Reinforcement Learning (i.e. Q-Learning, Deep Q-Networks (DQN) and others).

Key Questions

What is the ML algorithm class used?
Is the ML algorithm new?
If yes, why was it chosen over better known alternatives?

From Example Publication

Majority-based consensus classification based on 8 primary ML methods and post-processing.

2.2 Meta-predictions

Meta-predictions refer to predictions made by models that aggregate or utilize the outputs (predictions) of other models. Essentially, meta-prediction systems combine predictions from multiple models to produce a more robust or accurate final prediction. Meta-predictions are often used in ensemble learning techniques, where the goal is to leverage the strengths of several models to enhance overall performance.

Key Questions

Does the model use data from other ML algorithms as input?
If yes, which ones?
Is it clear that training data of initial predictors and meta-predictor are independent of test data for the meta-predictor?

From Example Publication

Yes, predictor output is a binary prediction computed from the consensus of other methods; Independence of training sets of other methods with test set of meta-predictor was not tested since datasets from other methods were not available.

2.3 Data encoding

Data encoding is the process of transforming data from one format or structure into another, often to make it easier for ML models or computational systems to process. In ML, data often needs to be encoded to ensure that it can be effectively interpreted by algorithms, especially for algorithms that require numerical input (e.g., neural networks, SVMs).

Key Questions

How were the data encoded and preprocessed for the ML algorithm?

From Example Publication

Label-wise average of 8 binary predictions.

2.4 Parameters

Model parameters are the internal configurations or variables that a model learns from the training data. These parameters determine how the model makes predictions and how well it fits the training data. The values of these parameters are adjusted during the training process through algorithms like gradient descent or optimization procedures.

Key Questions

How many parameters (p) are used in the model?
How were p selected?

From Example Publication

p = 3 (Consensus score threshold, expansion-erosion window, length threshold).
No optimization.

2.5 Features

In the context of ML, features refer to the individual measurable properties or characteristics of the data being used for training a model. They play a crucial role in determining the performance of ML models, as they provide the information that the model needs to make predictions or classifications. Feature Engineering is the process of creating, modifying, or selecting the most relevant features from the raw data to improve model performance by reducing model complexity, improving training time and avoiding overfitting.

Key Questions

How many features (f) are used as input?
Was feature selection performed?
If yes, was it performed using the training set only?

From Example Publication

Not applicable.

2.6 Fitting

Fitting refers to the process of training a ML model on a dataset by adjusting its parameters to minimize prediction error. The goal is to find a balance between underfitting and overfitting, ensuring that the model captures the underlying patterns in the data while still generalizing well to unseen data. Proper evaluation, regularization, and tuning of the model during the fitting process are crucial to achieving a good fit.

Key Questions

Is p much larger than the number of training points and/or is f large (for example, in classification is p >> (N_pos + N_neg) and/or f > 100)?
If yes, how was overfitting ruled out?
Conversely, if the number of training points is much larger than p and/or f is small (for example, (N_pos + N_neg) >> p and/or f < 5), how was underfitting ruled out?

From Example Publication

Single input ML methods are used with default parameters.
Optimization is a simple majority.

2.7 Regularization

Regularization is a technique used to prevent overfitting by adding a penalty to the loss function, which discourages the model from becoming too complex. Common regularization techniques include:

L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of the coefficients. It encourages sparsity, setting some coefficients to zero.
L2 Regularization (Ridge): Adds a penalty proportional to the square of the coefficients, discouraging large coefficients and thus reducing model complexity.
Dropout (in neural networks): Randomly drops a percentage of neurons during training, which helps prevent overfitting by forcing the network to generalize.

Key Questions

Were any overfitting prevention techniques used (for example, early stopping using a validation set)?
If yes, which ones?

From Example Publication

No.

2.8 Availability of configuration

Availability of configuration refers to the accessibility and transparency of the settings, parameters, and options that can be adjusted or customized in a ML model or system. These configurations control how the model is trained, how it makes predictions, and how it operates in different environments. Ensuring that the configuration is available, flexible, and easy to modify is important for reproducibility, fine-tuning, and deployment of models.

Key Questions

Are the hyperparameter configurations, optimization schedule, model files and optimization parameters reported?
If yes, where (for example, URL) and how (license)?

From Example Publication

Not applicable.

Ian Walsh, Dmytro Fishman, Dario Garcia-Gasulla, Tiina Titma, Gianluca Pollastri, Emidio Capriotti, Rita Casadio, Salvador Capella-Gutierrez, Davide Cirillo, Alessio Del Conte, Alexandros C. Dimopoulos, Victoria Dominguez Del Angel, Joaquin Dopazo, Piero Fariselli, José Maria Fernández, Florian Huber, Anna Kreshuk, Tom Lenaerts, Pier Luigi Martelli, Arcadi Navarro, Pilib Ó Broin, Janet Piñero, Damiano Piovesan, Martin Reczko, Francesco Ronzano, Venkata Satagopam, Castrense Savojardo, Vojtech Spiwok, Marco Antonio Tangaro, Giacomo Tartari, David Salgado, Alfonso Valencia, Federico Zambelli, Jennifer Harrow, Fotis E. Psomopoulos, Silvio C. E. Tosatto, and ELIXIR Machine Learning Focus Group. Dome: recommendations for supervised machine learning validation in biology. Nature Methods, 18(10):1122–1127, 2021. URL: https://doi.org/10.1038/s41592-021-01205-4, doi:10.1038/s41592-021-01205-4. ↩