Introduction

The need for standardization

With the significant drop in the cost of many high-throughput technologies, vast amounts of biological data are being generated and made available to researchers. Machine learning (ML) has emerged as a powerful tool for analyzing data related to cellular processes, genomics, proteomics, post-translational modifications, metabolism, and drug discovery, offering the potential for transformative medical advancements. This trend is evident in the growing number of ML publications, showcasing a wide array of modeling techniques in biology. However, although ML methods should ideally be experimentally validated, this occurs in only a small portion of the studies. We believe the time is right for the ML community to establish standards for reporting ML-based analyses to facilitate critical evaluation and enhance reproducibility.¹

Bar plots showing the trend of usage of ML in Biology. — The number of ML publications per year is based on Web of Science from 1996 onwards using the topic category for “machine learning” in combination with each of the following terms: “biolog*”, “medicine”, “genom*”, “prote*”, “cell*”, “post translational”, “metabolic” and “clinical”.

Guidelines or recommendations on the proper construction of machine learning (ML) algorithms can help ensure accurate results and predictions. In biomedical research, various communities have established standard guidelines and best practices for managing scientific data and ensuring the reproducibility of computational tools. Similarly, within the ML community, there is a growing need for a unified set of recommendations that address data handling, optimization techniques, model development, and evaluation protocols comprehensively.

A recent commentary emphasized the need for standards in ML,suggesting that introducing submission checklists could be a first step toward improving publication practices. In response, a community-driven consensus list of minimal requirements was proposed, framed as questions for ML implementers. By adhering to these guidelines, the quality and reliability of reported methods can be more accurately assessed. Our focus is on data, optimization, model, and evaluation (DOME), as these four components encompass the core aspects of most ML implementations. These recommendations are primarily aimed at supervised learning in biological applications where direct experimental validation is absent, as this is the most commonly used ML approach. We do not address the use of ML in clinical settings, and it remains to be seen whether the DOME recommendations can be applied to other areas of ML, such as unsupervised, semi-supervised, or reinforcement learning.

Development of the recommendations

Broad topic	Be on the lookout for	Consequences	Recommendation(s)
Data	• Inadequate data size & quality • Inappropriate partitioning, dependence between train and test data • Class imbalance • No access to data	• Data not representative of domain application • Unreliable or biased performance evaluation • Cannot check data credibility	• Use independent optimization (training) and evaluation (testing) sets. This is especially important for meta algorithms, where independence of multiple training sets must be shown to be independent of the evaluation (testing) sets. • Release data, preferably using appropriate long-term repositories, and include exact splits. • Offer sufficient evidence of data size & distribution being representative of the domain.
Optimization	• Overfitting, underfitting, and illegal parameter tuning • Imprecise parameters and protocols given	• Reported performance is too optimistic or too pessimistic • The model models noise or misses relevant relationships • Results are not reproducible	• Clarify that evaluation sets were not used for feature selection. • Report indicators on training and testing data that can aid in assessing the possibility of under- or overfitting; for example, train vs. test error. • Release definitions of all algorithmic hyperparameters, regularization protocols, parameters and optimization protocol. • For neural networks, release definitions of training and learning curves. • Include explicit model validation techniques like N-fold cross-validation.
Model	• Unclear if black box or interpretable model • No access to resulting source code, trained models & data • Execution time impractical	• An interpretable model shows no explainable behavior • Cannot cross compare methods & reproducibility, or check data credibility • Model takes too much time to produce results	• Describe the choice of black box or interpretable model. If interpretable, show examples of interpretable output. • Release documented source code + models + software containers. • Report execution time averaged across repeats. If computationally tough, compare to similar methods.
Evaluation	• Performance measures inadequate • No comparisons to baselines or other methods • Highly variable performance	• Biased performance measures reported • The method is falsely claimed as state-of-the-art • Unpredictable performance in production	• Compare with public methods & simple models (baselines). • Adopt community-validated measures and benchmark datasets for evaluation. • Compare related methods and alternatives on the same dataset. • Evaluate performance on a final independent held-out set. • Use confidence intervals/error intervals and statistical tests to gauge robustness.

The recommendations mentioned above were initially developed by the ELIXIR Machine Learning Focus Group in response to a published Comment advocating for the establishment of standards in ML for biology. This focus group, comprising over 50 experts in the field of ML, held meetings to collaboratively develop and refine the recommendations through broad consensus.

In the following chapters the publication from ² is going to be used as an example.

Ian Walsh, Dmytro Fishman, Dario Garcia-Gasulla, Tiina Titma, Gianluca Pollastri, Emidio Capriotti, Rita Casadio, Salvador Capella-Gutierrez, Davide Cirillo, Alessio Del Conte, Alexandros C. Dimopoulos, Victoria Dominguez Del Angel, Joaquin Dopazo, Piero Fariselli, José Maria Fernández, Florian Huber, Anna Kreshuk, Tom Lenaerts, Pier Luigi Martelli, Arcadi Navarro, Pilib Ó Broin, Janet Piñero, Damiano Piovesan, Martin Reczko, Francesco Ronzano, Venkata Satagopam, Castrense Savojardo, Vojtech Spiwok, Marco Antonio Tangaro, Giacomo Tartari, David Salgado, Alfonso Valencia, Federico Zambelli, Jennifer Harrow, Fotis E. Psomopoulos, Silvio C. E. Tosatto, and ELIXIR Machine Learning Focus Group. Dome: recommendations for supervised machine learning validation in biology. Nature Methods, 18(10):1122–1127, 2021. URL: https://doi.org/10.1038/s41592-021-01205-4, doi:10.1038/s41592-021-01205-4. ↩
Marco Necci, Damiano Piovesan, Zsuzsanna Dosztányi, and Silvio C.E Tosatto. MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics, 33(9):1402–1404, 01 2017. URL: https://doi.org/10.1093/bioinformatics/btx015, arXiv:https://academic.oup.com/bioinformatics/article-pdf/33/9/1402/49038981/bioinformatics\_33\_9\_1402.pdf, doi:10.1093/bioinformatics/btx015. ↩