4. Evaluation

In implementing a robust and trustworthy ML method, providing a comprehensive data description, adhering to a correct optimization protocol, and ensuring that the model is clearly defined and openly accessible are critical first steps. Equally important is employing a valid assessment methodology to evaluate the final model.

Further Reading

In biological research, there are two main types of evaluation scenarios for ML models:

Experimental Validation: This involves validating the predictions made by the ML model through laboratory experiments. Although highly desirable, this approach is often beyond the scope of many ML studies.
Computational Assessment: This involves evaluating the model’s performance using established metrics. This section focuses on computational assessment and highlights a few potential risks.

When it comes to performance metrics, which are quantifiable indicators of a model’s ability to address a specific task, there are numerous metrics available for various ML classification and regression problems. The wide range of options, along with the domain-specific knowledge needed to choose the right metrics, can result in the selection of inappropriate performance measures. It is advisable to use metrics recommended by critical assessment communities relevant to biological ML models, such as the Critical Assessment of Protein Function Annotation (CAFA) and the Critical Assessment of Genome Interpretation (CAGI).

Once appropriate performance metrics are selected, methods published in the same biological domain should be compared using suitable statistical tests (e.g., Student’s t-test) and confidence intervals. Additionally, to avoid releasing ML methods that seem advanced but do not outperform simpler algorithms, it is important to compare these methods against baseline models and demonstrate their statistical superiority (e.g., comparing shallow versus deep neural networks).¹

4.1 Evaluation method

Evaluation of a ML model is the process of assessing its performance and effectiveness in making predictions or classifications based on new, unseen data. Proper evaluation is crucial to ensure that the model generalizes well and performs as expected in real-world applications.

Key Questions

How was the method evaluated (for example cross-validation, independent dataset, novel experiments)?

From Example Publication

Independent dataset

4.2 Performance measures

The choice of evaluation metrics depends on the type of problem (regression or classification) and the specific goals of the analysis.

Regression Metrics	Classification Metrics
Mean Absolute Error (MAE)	Accuracy
Mean Squared Error (MSE)	Precision
Root Mean Squared Error (RMSE)	Recall (Sensitivity)
R-squared (R²)	F1 Score

Key Questions

Which performance metrics are reported (Accuracy, sensitivity, specificity, etc.)?
Is this set representative (for example, compared to the literature)?

From Example Publication

Balanced Accuracy, Precision, Sensitivity, Specificity, F1, MCC.

4.3 Comparison

Comparison typically refers to the evaluation of different models, algorithms, or configurations to identify which one performs best for a specific task. This process is essential for selecting the most suitable approach for a given problem, optimizing performance, and understanding the strengths and weaknesses of various methods.

Key Questions

Was a comparison to publicly available methods performed on benchmark datasets?
Was a comparison to simpler baselines performed?

From Example Publication

DisEmbl-465, DisEmbl-HL, ESpritz Disprot, ESpritz NMR, ESpritz Xray, Globplot, IUPred long, IUPred short, VSL2b. Chosen methods are the methods from which the meta prediction is obtained.

4.4 Confidence

Confidence in the context of ML refers to the measure of certainty or belief that a model's prediction is accurate. It quantifies the model's certainty regarding its output, which is particularly important in classification tasks, where decisions need to be made based on predicted class probabilities. This can be supported with medthods such as confidence intervals and statistical significance.

Key Questions

Do the performance metrics have confidence intervals?
Are the results statistically significant to claim that the method is superior to others and baselines?

From Example Publication

Not calculated.

4.5 Availability of evaluation

Availability of evaluation in ML refers to the accessibility and readiness of tools, frameworks, datasets, and methodologies used to assess the performance of ML models. This encompasses various aspects, from the datasets used for evaluation to the metrics and software tools that facilitate the evaluation process.

Key Questions

Are the raw evaluation files (for example, assignments for comparison and baselines, statistical code, confusion matrices) available?
If yes, where (for example, URL) and how (license)?

From Example Publication

Not.

Ian Walsh, Dmytro Fishman, Dario Garcia-Gasulla, Tiina Titma, Gianluca Pollastri, Emidio Capriotti, Rita Casadio, Salvador Capella-Gutierrez, Davide Cirillo, Alessio Del Conte, Alexandros C. Dimopoulos, Victoria Dominguez Del Angel, Joaquin Dopazo, Piero Fariselli, José Maria Fernández, Florian Huber, Anna Kreshuk, Tom Lenaerts, Pier Luigi Martelli, Arcadi Navarro, Pilib Ó Broin, Janet Piñero, Damiano Piovesan, Martin Reczko, Francesco Ronzano, Venkata Satagopam, Castrense Savojardo, Vojtech Spiwok, Marco Antonio Tangaro, Giacomo Tartari, David Salgado, Alfonso Valencia, Federico Zambelli, Jennifer Harrow, Fotis E. Psomopoulos, Silvio C. E. Tosatto, and ELIXIR Machine Learning Focus Group. Dome: recommendations for supervised machine learning validation in biology. Nature Methods, 18(10):1122–1127, 2021. URL: https://doi.org/10.1038/s41592-021-01205-4, doi:10.1038/s41592-021-01205-4. ↩