Publications
Book
2018
- S. Bezjak, A. Clyburne-Sherin, P. Conzett, P. Fernandes, E. Görögh, K. Helbig, B. Kramer, I. Labastida, K. Niemeyer, F. Psomopoulos, T. Ross-Hellauer, R. Schneider, J. Tennant, E. Verbakel, H. Brinken, and L. Heller, Open Science Training Handbook. Zenodo, 2018.
A group of fourteen authors came together in February 2018 at the TIB (German National Library of Science and Technology) in Hannover to create an open, living handbook on Open Science training. High-quality trainings are fundamental when aiming at a cultural change towards the implementation of Open Science principles. Teaching resources provide great support for Open Science instructors and trainers. The Open Science training handbook will be a key resource and a first step towards developing Open Access and Open Science curricula and andragogies. Supporting and connecting an emerging Open Science community that wishes to pass on their knowledge as multipliers, the handbook will enrich training activities and unlock the community’s full potential.
Book Chapter
2023
- F. Psomopoulos, C. Goble, L. J. Castro, J. Harrow, and S. C. E. Tosatto, “A Roadmap for Defining Machine Learning Standards in Life Sciences,” in Artificial Intelligence for Science, WORLD SCIENTIFIC, 2023, pp. 399–410.
Machine Learning (ML) is becoming increasingly popular in Life Sciences as an efficient mechanism to extract knowledge and new insights from the vast amounts of data that are constantly generated. In this context, ML is also becoming prevalent in disciplines that, traditionally, have not dealt consistently with Artificial Intelligence (AI) and ML. This increase in ML applications for Life Sciences highlights additional key challenges, beyond the traditional ones in the field of ML (such as efficiency, interpretability, federation etc.), notably: reproducibility; benchmarking; fairness; and standardization. In this work we provide an overview of why these challenges are particularly relevant in Life Sciences, while connecting them to some emerging solutions. Ranging from practical standards, such as the DOME Recommendations for reporting Machine Learning in Biology, to an alignment to the FAIR principles, all these solutions emerge as outputs of the scientific community rather than of a single individual or institution. This is the key component looking ahead as, for ML models to become trusted by all, community-backed standards need to be established and scientists must prioritize openness in all aspects of the process itself.
2022
- C. Galigalidou, L. Zaragoza-Infante, A. Chatzidimitriou, K. Stamatopoulos, F. Psomopoulos, and A. Agathangelidis, “Purpose-Built ImmunoinformaticsImmunoinformatics for BcR IGImmunoglobulin (IG)/TRT cell receptor (TR)Repertoire Data Analysis,” in Immunogenetics: Methods and Protocols, A. W. Langerak, Ed. New York, NY: Springer US, 2022, pp. 585–603.
The study of antigen receptor gene repertoires using next-generation sequencing (NGS) technologies has disclosed an unprecedented depth of complexity, requiring novel computational and analytical solutions. Several bioinformatics workflows have been developed to this end, including the T-cell receptor/immunoglobulin profiler (TRIP), a web application implemented in R shiny, specifically designed for the purposes of comprehensive repertoire analysis, which is the focus of this chapter. TRIP has the potential to perform robust immunoprofiling analysis through the extraction and processing of the IMGT/HighV-Quest output, via a series of functions, ensuring the analysis of high-quality, biologically relevant data through a multilevel process of data filtering. Subsequently, it provides in-depth analysis of antigen receptor gene rearrangements, including (a) clonality assessment; (b) extraction of variable (V), diversity (D), and joining (J) gene repertoires; (c) CDR3 characterization at both the nucleotide and amino acid level; and (d) somatic hypermutation analysis, in the case of immunoglobulin gene rearrangements. Relevant to mention, TRIP enables a high level of customization through the integration of various options in key aspects of the analysis, such as clonotype definition and computation, hence allowing for flexibility without compromising on accuracy.
2019
- A. Agathangelidis, F. Psomopoulos, and K. Stamatopoulos, “Stereotyped B Cell Receptor Immunoglobulins in B Cell Lymphomas,” in Lymphoma: Methods and Protocols, R. Küppers, Ed. New York, NY: Springer New York, 2019, pp. 139–155.
Comprehensive analysis of the clonotypic B cell receptor immunoglobulin (BcR IG) gene rearrangement sequences in patients with mature B cell neoplasms has led to the identification of significant repertoire restrictions, culminating in the discovery of subsets of patients expressing highly similar, stereotyped BcR IG. This finding strongly supports selection by common epitopes or classes of structurally similar epitopes in the ontogeny of these tumors. BcR IG stereotypy was initially described in chronic lymphocytic leukemia (CLL), where the stereotyped fraction of the disease accounts for a remarkable one-third of patients. However, subsequent studies showed that stereotyped BcR IG are also present in other neoplasms of mature B cells, including mantle cell lymphoma (MCL) and splenic marginal zone lymphoma (SMZL). Subsequent cross-entity comparisons led to the conclusion that stereotyped IG are mostly “disease-specific,” implicating distinct immunopathogenetic processes. Interestingly, mounting evidence suggests that a molecular subclassification of lymphomas based on BcR IG stereotypy is biologically and clinically relevant. Indeed, particularly in CLL, patients assigned to the same subset due to expressing a particular stereotyped BcR IG display remarkably consistent biological background and clinical course, at least for major and well-studied subsets. Thus, the robust assignment to stereotyped subsets may assist in the identification of mechanisms underlying disease onset and progression, while also refining risk stratification. In this book chapter, we provide an overview of the recent BcR IG stereotypy studies in mature B cell malignancies and outline previous and current methodological approaches used for the identification of stereotyped IG.
Journals
2025
- G. I. Gavriilidis, V. Vasileiou, S. Dimitsaki, G. Karakatsoulis, A. Giannakakis, G. A. Pavlopoulos, and F. Psomopoulos, “APNet, an explainable sparse deep learning model to discover differentially active drivers of severe COVID-19,” Bioinformatics, vol. 41, no. 3, Mar. 2025, doi: 10.1093/bioinformatics/btaf063.
MOTIVATION: Computational analyses of bulk and single-cell omics provide translational insights into complex diseases, such as COVID-19, by revealing molecules, cellular phenotypes, and signalling patterns that contribute to unfavourable clinical outcomes. Current in silico approaches dovetail differential abundance, biostatistics, and machine learning, but often overlook nonlinear proteomic dynamics, like post-translational modifications, and provide limited biological interpretability beyond feature ranking. RESULTS: We introduce APNet, a novel computational pipeline that combines differential activity analysis based on SJARACNe co-expression networks with PASNet, a biologically informed sparse deep learning model, to perform explainable predictions for COVID-19 severity. The APNet driver-pathway network ingests SJARACNe co-regulation and classification weights to aid result interpretation and hypothesis generation. APNet outperforms alternative models in patient classification across three COVID-19 proteomic datasets, identifying predictive drivers and pathways, including some confirmed in single-cell omics and highlighting under-explored biomarker circuitries in COVID-19. AVAILABILITY AND IMPLEMENTATION: APNet’s R, Python scripts, and Cytoscape methodologies are available at https://github.com/BiodataAnalysisGroup/APNet.
- F. Psomopoulos et al., “Toward a unified approach: Considerations for bioinformatic and sequencing activities & data in wastewater surveillance of biologic public health threats,” Open Research Europe, vol. 5, p. 267, Sep. 2025, doi: 10.12688/openreseurope.20934.1.
Genomic technologies like PCR and next-generation sequencing (NGS) have greatly advanced public health surveillance, especially during COVID-19, by enabling detailed tracking of pathogen spread, origins, and variants. While PCR is vital for targeted detection, falling NGS costs have made large-scale, high-throughput sequencing more feasible, supporting broader pathogen monitoring—including the detection of vaccine escape variants and new strains. NGS applied to wastewater offers valuable population-level insights but faces challenges such as variable sample complexity, the need for skilled staff, suitable platforms, and robust IT infrastructure. Although there are currently a lot of efforts towards defining guidelines for sampling, analysis, and integrating wastewater data into public health policy, such as the recently published International Cookbook for Wastewater Practitioners, they often lack universal applicability, emphasizing the analytical approaches in favour of the NGS-based ones. However, standardising protocols for sampling, sequencing, and analysis is crucial to ensure reliable, comparable data across surveillance systems worldwide. Pilot studies and continuous refinement are recommended to overcome implementation hurdles and fully realise the benefits of NGS in wastewater surveillance. This work attempts to outline these challenges and opportunities across the entire wastewater surveillance workflow, from data generation to reporting, and provide some concrete suggestions and considerations across the spectrum of activities.
- E. Aßmann, T. Greiner, H. Richard, M. Wade, S. Agrawal, F. Amman, S. Böttcher, S. Lackner, M. Landthaler, S. Mangul, V. Munteanu, F. Psomopoulos, M. Smith, M. Trofimova, A. Ullrich, M. von Kleist, E. Wyler, M. Hölzer, and C. Irrgang, “Augmentation of wastewater-based epidemiology with machine learning to support global health surveillance,” Nature Water, vol. 3, no. 7, pp. 753–763, Jul. 2025, doi: 10.1038/s44221-025-00444-5.
Wastewater-based epidemiology (WBE) has proven to be a valuable tool for monitoring the evolution and spread of global health threats, from pathogens to antimicrobial resistances. Throughout the COVID-19 pandemic, multiple wastewater surveillance programmes have advanced statistical and machine learning methods for detecting pathogens from wastewater sequencing data and correlating measured targets with the represented population to infer meaningful conclusions for public health. Integrating contextual data can account for measurement uncertainties across the WBE workflow that affect the reliability of analyses. However, the broader availability and harmonization of data are major obstacles to method development. Here we review the benefits and limitations of wastewater-related data streams, highlighting the potential of machine learning to leverage these streams for normalization and other WBE applications. We emphasize the relevance of developing global frameworks for integrating WBE with other health surveillance systems and discuss next steps to address current and foreseeable challenges for robust and interpretable machine learning-enhanced WBE.
- K. B. Shiferaw, I. Balaur, G. Collins, C. Sharma, L. J. Castro, F. Psomopoulos, D. Garijo, R. Henkel, D. Waltemath, and A. A. Zeleke, “Calibrating CONSORT-AI with FAIR Principles to enhance reproducibility in AI-driven clinical trials,” medRxiv, 2025, doi: 10.1101/2025.07.07.25330987.
Artificial intelligence (AI) is increasingly embedded in clinical trials, yet poor reproducibility remains a critical barrier to trustworthy and transparent research. In this study, we propose a structured calibration of the CONSORT-AI reporting guideline using the FAIR (Findable, Accessible, Interoperable, Reusable) principles. We introduce the application of CALIFRAME, a framework designed to evaluate and align existing medical AI reporting standards with FAIR-compliant practices. Applying CALIFRAME to the CONSORT-AI checklist reveals specific gaps in data and code sharing, metadata use, and accessibility practices in current AI-driven clinical trials. Our results underscore the need for standardized metadata, clear licensing, and stakeholder-inclusive design in medical AI reporting. We demonstrate that FAIR- oriented calibration of reporting guidelines can bridge the reproducibility gap and support more transparent, efficient, and reusable AI interventions in healthcare. This work advocates for a shift toward reproducibility as a foundation for trustworthy AI in clinical research.Competing Interest StatementThe authors have declared no competing interest.Funding StatementK.B.S. was funded by the DAAD (German Academic Exchange) for supporting the doctoral research study expenses.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present work are contained in the manuscript
- F. A. Baltoumas, E. Karatzas, N. K. Venetsianou, E. Aplakidou, K. Giatras, M. N. Chasapi, I. N. Chasapi, I. Iliopoulos, V. A. Iconomidou, I. P. Trougakos, F. Psomopoulos, A. Giannakakis, I. Georgakopoulos-Soares, P. Kontou, P. G. Bagos, and G. A. Pavlopoulos, “Darling (v2.0): Mining disease-related databases for the detection of biomedical entity associations,” Computational and Structural Biotechnology Journal, vol. 27, pp. 2626–2637, 2025, doi: https://doi.org/10.1016/j.csbj.2025.06.025.
Darling is a web application that employs literature mining to detect disease-related biomedical entity associations. Darling can detect sentence-based cooccurrences of biomedical entities such as genes, proteins, chemicals, functions, tissues, diseases, environments, and phenotypes from biomedical literature found in six disease-centric databases. In this version, we deploy additional query channels focusing on COVID-19, GWAS studies, cardiovascular, neurodegenerative, and cancer diseases. Compared to its predecessor, users now have extended query options including searches with PubMed identifiers, disease records, entity names, titles, single nucleotide polymorphisms, or the Entrez syntax. Furthermore, after applying named entity recognition, one can retrieve and mine the relevant literature from recognized terms for a free input text. Term associations are captured in customizable networks which can be further filtered by either term or co-occurrence frequency and visualized in 2D as weighted graphs or in 3D as multi-layered networks. The fetched terms are organized in searchable tables and clustered annotated documents. The reported genes can be further analyzed for functional enrichment using external applications called from within Darling. The Darling databases, including terms and their associations, are updated annually. Darling is available at: https://www.darling-miner.org/.
- “Training Platform,” May 2025, doi: 10.7490/f1000research.1120171.1.
The ELIXIR Training Platform (TrP) is a key infrastructure of Europe’s bioinformatics training landscape, aiming to strengthen national training programmes, grow bioinformatics training capacity and competence across Europe and beyond. Over the past decade, the TrP has actively collaborated with partners and experts to establish best practices, tools and standards, resulting in consistent improvement in the training quality and capacity across ELIXIR members. In its current 2024-2026 work programme, the TrP aims to consolidate these resources and achievements, intensifying support for stakeholders, and disseminating these resources more effectively. This initiative provides several benefits for stakeholders, offering strategic advantages by aligning training strategies across ELIXIR Communities, fostering cohesion and synergy to optimise training efforts and advance bioinformatics education and research across Europe. Central to this is SPLASH, a new digital hub, built around the training lifecycle that embraces the whole ELIXIR training ecosystem. It guides training stakeholders through planning, designing, delivering, and evaluating training, all essential elements for fortifying a robust training strategy. It will showcase training resources and projects in ELIXIR, such as: Training eSupport System (TeSS) portal, disseminating training events and materials. ELIXIR-GOBLET Train-the-Trainer programme, building capacity in training skills. Learning Paths, spearheading the development of structured learning programmes. Training Metrics Database (TMD) , providing training impact metrics. FAIR Training, spearheading the implementation of FAIR principles in training. Training Certification, establishing a certification process for training. ELIXIR Training Lesson Template , ELIXIR’s template for authoring and publishing lessons. E-learning, providing best practices for e-learning. ELIXIR-SI eLearning Platform (EeLP), providing e-learning management systems.
2024
- O. A. Attafi et al., “DOME Registry: implementing community-wide recommendations for reporting supervised machine learning in biology,” GigaScience, vol. 13, p. giae094, Dec. 2024, doi: 10.1093/gigascience/giae094.
Supervised machine learning (ML) is used extensively in biology and deserves closer scrutiny. The Data Optimization Model Evaluation (DOME) recommendations aim to enhance the validation and reproducibility of ML research by establishing standards for key aspects such as data handling and processing, optimization, evaluation, and model interpretability. The recommendations help to ensure that key details are reported transparently by providing a structured set of questions. Here, we introduce the DOME registry (URL: registry.dome-ml.org), a database that allows scientists to manage and access comprehensive DOME-related information on published ML studies. The registry uses external resources like ORCID, APICURON, and the Data Stewardship Wizard to streamline the annotation process and ensure comprehensive documentation. By assigning unique identifiers and DOME scores to publications, the registry fosters a standardized evaluation of ML methods. Future plans include continuing to grow the registry through community curation, improving the DOME score definition and encouraging publishers to adopt DOME standards, and promoting transparency and reproducibility of ML in the life sciences.
- S.-C. Fragkouli, D. Solanki, L. Castro, F. Psomopoulos, N. Queralt-Rosinach, D. Cirillo, and L. Crossman, “Synthetic data: how could it be used in infectious disease research?,” Future Microbiology, vol. 0, no. 0, pp. 1–6, 2024, doi: 10.1080/17460913.2024.2400853.
Over the last 3–5 years, it has become possible to generate machine learning (ML) synthetic data (SD) for healthcare related uses. However, concerns have been raised about potential negative factors associated with the possibilities of artificial dataset generation. These include the potential misuse of generative artificial intelligence (AI) in fields such as cybercrime, the use of deepfakes and fake news to deceive or manipulate, and displacement of human jobs across various market sectors. Here we consider both current and future positive advances and possibilities with synthetic datasets. SD offers several advantages in research and ML. First, it enhances privacy by creating datasets that avoid direct identification of individuals while preserving the statistical characteristics of the original data. Second, it improves representativity by enabling the creation of datasets that better reflect broader populations, thus addressing potential biases in real-world data. Finally, SD serves as a valuable tool for data augmentation by expanding existing datasets with additional examples that can boost the performance and robustness of ML models. Generative AI is an artificial intelligence genre capable of creating text, images, video or other data using generative models. The recent explosion of interest in GenAI was heralded by the invention and speedy move to use of large language models (LLM). These computational models are able to achieve general-purpose language generation and other natural language processing tasks. LLM are based on transformer architectures which improved on previous neural network architectures and were put forward by Vaswani et al. where GenAI made an evolutionary leap from recurrent neural networks. Fueled by the advent of improved GenAI techniques and wide scale usage, this is surely the time to consider how SD can be used to advance infectious disease research. In this commentary we aim to create an overview of the current and future position of SD in infectious disease research .
- N. Pechlivanis, G. Karakatsoulis, K. Kyritsis, M. Tsagiopoulou, S. Sgardelis, I. Kappas, and F. Psomopoulos, “Microbial co-occurrence network demonstrates spatial and climatic trends for global soil diversity,” Scientific Data, vol. 11, no. 1, p. 672, 2024, doi: 10.1038/s41597-024-03528-1.
Despite recent research efforts to explore the co-occurrence patterns of diverse microbes within soil microbial communities, a substantial knowledge-gap persists regarding global climate influences on soil microbiota behaviour. Comprehending co-occurrence patterns within distinct geoclimatic groups is pivotal for unravelling the ecological structure of microbial communities, that are crucial for preserving ecosystem functions and services. Our study addresses this gap by examining global climatic patterns of microbial diversity. Using data from the Earth Microbiome Project, we analyse a meta-community co-occurrence network for bacterial communities. This method unveils substantial shifts in topological features, highlighting regional and climatic trends. Arid, Polar, and Tropical zones show lower diversity but maintain denser networks, whereas Temperate and Cold zones display higher diversity alongside more modular networks. Furthermore, it identifies significant co-occurrence patterns across diverse climatic regions. Central taxa associated with different climates are pinpointed, highlighting climate’s pivotal role in community structure. In conclusion, our study identifies significant correlations between microbial interactions in diverse climatic regions, contributing valuable insights into the intricate dynamics of soil microbiota.
- S. G. Sutcliffe et al., “Tracking SARS-CoV-2 variants of concern in wastewater: an assessment of nine computational tools using simulated genomic data,” Microbial Genomics, vol. 10, no. 5, 2024, doi: https://doi.org/10.1099/mgen.0.001249.
Wastewater-based surveillance (WBS) is an important epidemiological and public health tool for tracking pathogens across the scale of a building, neighbourhood, city, or region. WBS gained widespread adoption globally during the SARS-CoV-2 pandemic for estimating community infection levels by qPCR. Sequencing pathogen genes or genomes from wastewater adds information about pathogen genetic diversity, which can be used to identify viral lineages (including variants of concern) that are circulating in a local population. Capturing the genetic diversity by WBS sequencing is not trivial, as wastewater samples often contain a diverse mixture of viral lineages with real mutations and sequencing errors, which must be deconvoluted computationally from short sequencing reads. In this study we assess nine different computational tools that have recently been developed to address this challenge. We simulated 100 wastewater sequence samples consisting of SARS-CoV-2 BA.1, BA.2, and Delta lineages, in various mixtures, as well as a Delta–Omicron recombinant and a synthetic ‘novel’ lineage. Most tools performed well in identifying the true lineages present and estimating their relative abundances and were generally robust to variation in sequencing depth and read length. While many tools identified lineages present down to 1 % frequency, results were more reliable above a 5 % threshold. The presence of an unknown synthetic lineage, which represents an unclassified SARS-CoV-2 lineage, increases the error in relative abundance estimates of other lineages, but the magnitude of this effect was small for most tools. The tools also varied in how they labelled novel synthetic lineages and recombinants. While our simulated dataset represents just one of many possible use cases for these methods, we hope it helps users understand potential sources of error or bias in wastewater sequencing analysis and to appreciate the commonalities and differences across methods.
- G. I. Gavriilidis, V. Vasileiou, A. Orfanou, N. Ishaque, and F. Psomopoulos, “A mini-review on perturbation modelling across single-cell omic modalities,” Computational and Structural Biotechnology Journal, vol. 23, pp. 1886–1896, Dec. 2024, doi: 10.1016/j.csbj.2024.04.058.
Recent advances in single-cell omics technology have transformed the landscape of cellular and molecular research, enriching the scope and intricacy of cellular characterisation. Perturbation modelling seeks to comprehensively grasp the effects of external influences like disease onset or molecular knock-outs or external stimulants on cellular physiology, specifically on transcription factors, signal transducers, biological pathways, and dynamic cell states. Machine and deep learning tools transform complex perturbational phenomena in algorithmically tractable tasks to formulate predictions based on various types of single-cell datasets. However, the recent surge in tools and datasets makes it challenging for experimental biologists and computational scientists to keep track of the recent advances in this rapidly expanding filed of single-cell modelling. Here, we recapitulate the main objectives of perturbation modelling and summarise novel single-cell perturbation technologies based on genetic manipulation like CRISPR or compounds, spanning across omic modalities. We then concisely review a burgeoning group of computational methods extending from classical statistical inference methodologies to various machine and deep learning architectures like shallow models or autoencoders, to biologically informed approaches based on gene regulatory networks, and to combinatorial efforts reminiscent of ensemble learning. We also discuss the rising trend of large foundational models in single-cell perturbation modelling inspired by large language models. Lastly, we critically assess the challenges that underline single-cell perturbation modelling while pointing towards relevant future perspectives like perturbation atlases, multi-omics and spatial datasets, causal machine learning for interpretability, multi-task learning for performance and explainability as well as prospects for solving interoperability and benchmarking pitfalls.
- V. Makarov, C. Chabbert, E. Koletou, F. Psomopoulos, N. Kurbatova, S. Ramirez, C. Nelson, P. Natarajan, and B. Neupane, “Good machine learning practices: Learnings from the modern pharmaceutical discovery enterprise,” Computers in Biology and Medicine, vol. 177, p. 108632, Jul. 2024, doi: 10.1016/j.compbiomed.2024.108632.
Machine Learning (ML) and Artificial Intelligence (AI) have become an integral part of the drug discovery and development value chain. Many teams in the pharmaceutical industry nevertheless report the challenges associated with the timely, cost effective and meaningful delivery of ML and AI powered solutions for their scientists. We sought to better understand what these challenges were and how to overcome them by performing an industry wide assessment of the practices in AI and Machine Learning. Here we report results of the systematic business analysis of the personas in the modern pharmaceutical discovery enterprise in relation to their work with the AI and ML technologies. We identify 23 common business problems that individuals in these roles face when they encounter AI and ML technologies at work, and describe best practices (Good Machine Learning Practices) that address these issues.
- F. Psomopoulos, E. Capriotti, N. Queralt-Rosinach, L. Jael Castro, and S. Tosatto, “Current activities of the ELIXIR Machine Learning Focus Group,” Sep. 2024, doi: 10.7490/f1000research.1119845.1.
Motivations Machine Learning (ML) has emerged as a discipline that enables computers to assist humans in making sense of large and complex data sets. With the drop in the cost of high-throughput technologies, large amounts of omics data are being generated and made accessible to researchers. Analyzing these complex high-volume data is not trivial, and the use of classical statistics cannot explore their full potential. Machine Learning can thus be very useful in mining large omics datasets to uncover new insights that can consequently lead to the advancement of Life Sciences. Aims The main aims of ELIXIR Machine Learning Focus Group (MLFG) is to increase the reproducibility and transparency of ML methods for researchers, the journal reviewers and the wider community. One of the most pressing issues is to agree to a standardized data structure to describe the most relevant features of ML studies being published. The development of standardized reporting model has the potential to make a major impact in increasing the quality of ML-based publications. Results Our focus group was initiated in October 2019, in order to capture the emerging need in Machine Learning expertise across the network. The main result so far is the DOME Recommendations, a set of community-wide recommendations for reporting supervised machine learning–based analyses applied to biological studies. Tasks Currently the ELIXIR Machine Learning Focus Group is working on three main tasks: (1) using the DOME recommendations to annotate relevant literature in order to gain insights into the level of adherence to DOME; (2) evaluating the gold standard datasets widely used in ML process in order to define and describe the aspects of a gold standard with particular focus on human data; (3) reviewing the efforts around synthetic data in order to establish a set of best practices for their use and application in ML
- M. van Baardwijk, G. Gavriilidis, N. Ishaque, O. Lazareva, V. Vasileiou, A. Orfanou, A. Stubbs, O. Stegle, and F. Psomopoulos, “Multitask perturbation modeling for single-cell omics,” Aug. 2024, doi: 10.7490/f1000research.1119837.1.
New paradigms of single-cell, spatial and multi-modal omics have impacted the way and the resolution in which cellular and molecular research (CMR) is conducted. While these technologies increase the phenotypic depth and breadth of cellular characterization, there is also a need to model dynamic behaviour in these systems, such as predicting the effect of perturbations. However, these innovations also complicate data processing/analysis and require expertise across multiple domains. This workshop builds up from the Mongoose ELIXIR Staff exchange project (Multi-Objective Network Generator Of Optimized Single-cell Experiments) (INAB | CERTH & CHARITE & Erasmus MC), which investigates early integration of multi-modal omics data through multi-task learning1 (i.e., computations to exploit shared feature relationships between related tasks and learn better regularized latent representations in transcriptomics with imaging, or proteomics with epigenetics) together with perturbation modeling (e.g., cells treated with drugs or CRISPR knockouts2) at the single-cell level. Thus far, we have seen the application of advanced deep learning strategies for multi-task learning (e.g., UnitedNet3, scMTNI4) and perturbation modeling (e.g., scGEN, CPA)5. Still, the combination of both is currently lacking. Importantly, explainable AI will be paramount in understanding how these DL models can capture biological dynamics across different data modalities. Given the complex, multi-faceted nature of this proposal, there is a need for dynamic participant interaction across various domains. This workshop will bring together experts across Communities (e.g., Single-Cell Omics, Systems Biology, Proteomics), Platforms (Compute), and Focus Groups (e.g., Machine Learning, Cancer Data). In the workshop, we will introduce the principles underpinning the Mongoose project (multi-task learning and perturbation modeling) and dynamically engage with participants to identify opportunities to exploit these technologies across domains. Especially considering the ELIXIR Cellular and Molecular Research Priority area of the new 2024-2028 work-programme, where the focus is on standardisation and best practices in multi-modal methods and knowledge representation in molecular structure, imaging and multi-omics technologies, as well as on new services for cellular and molecular biology towards a semantically interoperable and complementary network of FAIR data and tools, a key outcome of this workshop will be a list of actionable activities that could be directly tied in to the CMR future efforts.
2023
- I. Gkekas, A.-C. Vagiona, N. Pechlivanis, G. Kastrinaki, K. Pliatsika, S. Iben, K. Xanthopoulos, F. E. Psomopoulos, M. A. Andrade-Navarro, and S. Petrakis, “Intranuclear inclusions of polyQ-expanded ATXN1 sequester RNA molecules,” Frontiers in Molecular Neuroscience, vol. 16, Dec. 2023, doi: 10.3389/fnmol.2023.1280546.
Spinocerebellar ataxia type 1 (SCA1) is an autosomal dominant neurodegenerative disease caused by a trinucleotide (CAG) repeat expansion in the ATXN1 gene. It is characterized by the presence of polyglutamine (polyQ) intranuclear inclusion bodies (IIBs) within affected neurons. In order to investigate the impact of polyQ IIBs in SCA1 pathogenesis, we generated a novel protein aggregation model by inducible overexpression of the mutant ATXN1(Q82) isoform in human neuroblastoma SH-SY5Y cells. Moreover, we developed a simple and reproducible protocol for the efficient isolation of insoluble IIBs. Biophysical characterization showed that polyQ IIBs are enriched in RNA molecules which were further identified by next-generation sequencing. Finally, a protein interaction network analysis indicated that sequestration of essential RNA transcripts within ATXN1(Q82) IIBs may affect the ribosome resulting in error-prone protein synthesis and global proteome instability. These findings provide novel insights into the molecular pathogenesis of SCA1, highlighting the role of polyQ IIBs and their impact on critical cellular processes.
- E. Sofou, G. Gkoliou, N. Pechlivanis, K. Pasentsis, K. Chatzistamatiou, F. Psomopoulos, T. Agorastos, and K. Stamatopoulos, “High risk HPV-positive women cervicovaginal microbial profiles in a Greek cohort: a retrospective analysis of the GRECOSELF study,” Frontiers in Microbiology, vol. 14, Nov. 2023, doi: 10.3389/fmicb.2023.1292230.
Increasing evidence supports a role for the vaginal microbiome (VM) in the severity of HPV infection and its potential link to cervical intraepithelial neoplasia. However, a lot remains unclear regarding the precise role of certain bacteria in the context of HPV positivity and persistence of infection. Here, using next generation sequencing (NGS), we comprehensively profiled the VM in a series of 877 women who tested positive for at least one high risk HPV (hrHPV) type with the COBAS® 4,800 assay, after self-collection of a cervico-vaginal sample. Starting from gDNA, we PCR amplified the V3–V4 region of the bacterial 16S rRNA gene and applied a paired-end NGS protocol (Illumina). We report significant differences in the abundance of certain bacteria compared among different HPV-types, more particularly concerning species assigned to Lacticaseibacillus, Megasphaera and Sneathia genera. Especially for Lacticaseibacillus, we observed significant depletion in the case of HPV16, HPV18 versus hrHPVother. Overall, our results suggest that the presence or absence of specific cervicovaginal microbial genera may be linked to the observed severity in hrHPV infection, particularly in the case of HPV16, 18 types.
- K. A. Kyritsis, N. Pechlivanis, and F. Psomopoulos, “Software pipelines for RNA-Seq, ChIP-Seq and germline variant calling analyses in common workflow language (CWL),” Frontiers in Bioinformatics, vol. 3, Nov. 2023, doi: 10.3389/fbinf.2023.1275593.
Background: Automating data analysis pipelines is a key requirement to ensure reproducibility of results, especially when dealing with large volumes of data. Here we assembled automated pipelines for the analysis of High-throughput Sequencing (HTS) data originating from RNA-Seq, ChIP-Seq and Germline variant calling experiments. We implemented these workflows in Common workflow language (CWL) and evaluated their performance by: i) reproducing the results of two previously published studies on Chronic Lymphocytic Leukemia (CLL), and ii) analyzing whole genome sequencing data from four Genome in a Bottle Consortium (GIAB) samples, comparing the detected variants against their respective golden standard truth sets. Findings: We demonstrated that CWL-implemented workflows clearly achieved high accuracy in reproducing previously published results, discovering significant biomarkers and detecting germline SNP and small INDEL variants. Conclusion: CWL pipelines are characterized by reproducibility and reusability; combined with containerization, they provide the ability to overcome issues of software incompatibility and laborious configuration requirements. In addition, they are flexible and can be used immediately or adapted to the specific needs of an experiment or study. The CWL-based workflows developed in this study, along with version information for all software tools, are publicly available on GitHub (https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines) under the MIT License. They are suitable for the analysis of short-read (such as Illumina-based) data and constitute an open resource that can facilitate automation, reproducibility and cross-platform compatibility for standard bioinformatic analyses.
- A. Iatrou et al., “N-Glycosylation of the Ig Receptors Shapes the Antigen Reactivity in Chronic Lymphocytic Leukemia Subset #201,” The Journal of Immunology, vol. 211, no. 5, pp. 743–754, Jul. 2023, doi: 10.4049/jimmunol.2300330.
Subset #201 is a clinically indolent subgroup of patients with chronic lymphocytic leukemia defined by the expression of stereotyped, mutated IGHV4-34/IGLV1-44 BCR Ig. Subset #201 is characterized by recurrent somatic hypermutations (SHMs) that frequently lead to the creation and/or disruption of N-glycosylation sites within the Ig H and L chain variable domains. To understand the relevance of this observation, using next-generation sequencing, we studied how SHM shapes the subclonal architecture of the BCR Ig repertoire in subset #201, particularly focusing on changes in N-glycosylation sites. Moreover, we profiled the Ag reactivity of the clonotypic BCR Ig expressed as rmAbs. We found that almost all analyzed cases from subset #201 carry SHMs potentially affecting N-glycosylation at the clonal and/or subclonal level and obtained evidence for N-glycan occupancy in SHM-induced novel N-glycosylation sites. These particular SHMs impact (auto)antigen recognition, as indicated by differences in Ag reactivity between the authentic rmAbs and germline revertants of SHMs introducing novel N-glycosylation sites in experiments entailing 1) flow cytometry for binding to viable cells, 2) immunohistochemistry against various human tissues, 3) ELISA against microbial Ags, and 4) protein microarrays testing reactivity against multiple autoantigens. On these grounds, N-glycosylation appears as relevant for the natural history of at least a fraction of Ig-mutated chronic lymphocytic leukemia. Moreover, subset #201 emerges as a paradigmatic case for the role of affinity maturation in the evolution of Ag reactivity of the clonotypic BCR Ig.
- E. A. Huerta et al., “FAIR for AI: An interdisciplinary and international community building perspective,” Scientific Data, vol. 10, no. 1, Jul. 2023, doi: 10.1038/s41597-023-02298-6.
A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022.
- E. Sofou, E. Vlachonikola, L. Zaragoza-Infante, M. Brüggemann, N. Darzentas, P. J. T. A. Groenen, M. Hummel, E. A. Macintyre, F. Psomopoulos, F. Davi, A. W. Langerak, and K. Stamatopoulos, “Clonotype definitions for immunogenetic studies: proposals from the EuroClonality NGS Working Group,” Leukemia, vol. 37, no. 8, pp. 1750–1752, Jun. 2023, doi: 10.1038/s41375-023-01952-7.
Comprehensive study of immunoglobulin (IG) and T cell receptor (TR) gene rearrangements has proven instrumental for understanding immune responses in health and disease, while also offering information with direct clinical utility e.g., for minimal residual disease detection or clonality assessment in patients with lymphoid malignancies [1]. However, immunogenetic analysis entails descriptive definitions that are often arbitrary. A particular challenge is posed by the term “clonotype”, generally referring to a unique antigen receptor gene rearrangement, for which different definitions have been proposed (Supplementary Table 1). The lack of a consistent definition of what a clonotype is can lead to different interpretations. This is especially true in Next Generation Sequencing (NGS) repertoire studies, where clustering of rearrangement sequences to clonotypes happens at the initial stages of data processing, thus affecting (meta)data interpretation. That said, it is crucial to reach a consensus on the stringent definition of a clonotype while highlighting cases where an alternative definition could be appropriate, depending on a given context or specific research hypothesis.
- A. Sachinidis, M. Trachana, A. Taparkou, G. Gavriilidis, P. Verginis, F. Psomopoulos, C. Adamichou, D. Boumpas, and A. Garyfallos, “Investigating the Role of T-bet+ B Cells (ABCs/DN) in the Immunopathogenesis of Systemic Lupus Erythematosus,” Mediterranean Journal of Rheumatology, vol. 34, no. 1, p. 117, 2023, doi: 10.31138/mjr.34.1.117.
Background: Age-associated B cells (ABCs) constitute a B cell subset, defined as CD19+CD21 CD11c+, that expands continuously with age and accumulates strongly in individuals with autoimmune and/or infectious diseases. In humans, ABCs are principally IgD-CD27- double-negative (DN) B cells. Data from murine models of autoimmunity, implicate ABCs/DN in the development of autoimmune disorders. T-bet, a transcription factor which is highly expressed in these cells, is considered to play a major role in various aspects of autoimmunity, such as the production of autoantibodies and the formation of spontaneous germinal centres. Aims of the study: Despite the available data, the functional features of ABCs/DN and their exact role in the pathogenesis of autoimmunity remain elusive. This project focuses on the investigation of the role of ABCs/DN in the pathogenesis of systemic lupus erythematosus (SLE) in humans, as well as the effects that various pharmacological agents may have on these cells. Methods: Samples from patients with active SLE will be used to enumerate and immunophenotype - via flow cytometry - the ABCs/DN found in the peripheral blood of the patients. Transcriptomic analysis and functional assays for the cells, both before and after in vitro pharmacological treatments, will also be performed. Anticipated benefits: The results of the study are expected to allow characterization of the pathogenetic role of ABCs/DN in SLE and could probably contribute, following careful association with the clinical state of the patients, towards the discovery and validation of novel prognostic and diagnostic markers of disease
- M. Tsagiopoulou, V. Chapaprieta, N. Russiñol, B. García-Torre, N. Pechlivanis, F. Nadeu, N. Papakonstantinou, N. Stavroyianni, A. Chatzidimitriou, F. Psomopoulos, E. Campo, K. Stamatopoulos, and J. I. Martin-Subero, “CHROMATIN ACTIVATION PROFILING OF STEREOTYPED CHRONIC LYMPHOCYTIC LEUKEMIAS REVEALS A SUBSET #8 SPECIFIC SIGNATURE,” Blood, Mar. 2023, doi: 10.1182/blood.2022016587.
The chromatin activation landscape of chronic lymphocytic leukemia (CLL) with stereotyped B-cell receptor immunoglobulin is currently unknown. In this study, we report the results of a whole-genome chromatin profiling of histone 3 lysine 27 acetylation of 22 CLLs from major subsets, which were compared against nonstereotyped CLLs and normal B-cell subpopulations. Although subsets 1, 2, and 4 did not differ much from their nonstereotyped CLL counterparts, subset 8 displayed a remarkably distinct chromatin activation profile. In particular, we identified 209 de novo active regulatory elements in this subset, which showed similar patterns with U-CLLs undergoing Richter transformation. These regions were enriched for binding sites of 9 overexpressed transcription factors. In 78 of 209 regions, we identified 113 candidate overexpressed target genes, 11 regions being associated with more than 2 adjacent genes. These included blocks of up to 7 genes, suggesting local coupregulation within the same genome compartment. Our findings further underscore the uniqueness of subset 8 CLL, notable for the highest risk of Richter’s transformation among all CLLs and provide additional clues to decipher the molecular basis of its clinical behavior.
- E. Vlachonikola et al., “T cell receptor gene repertoire profiles in subgroups of patients with chronic lymphocytic leukemia bearing distinct genomic aberrations,” Frontiers in Oncology, vol. 13, Feb. 2023, doi: 10.3389/fonc.2023.1097942.
Background: Microenvironmental interactions of the malignant clone with T cells are critical throughout the natural history of chronic lymphocytic leukemia (CLL). Indeed, clonal expansions of T cells and shared clonotypes exist between different CLL patients, strongly implying clonal selection by antigens. Moreover, immunogenic neoepitopes have been isolated from the clonotypic B cell receptor immunoglobulin sequences, offering a rationale for immunotherapeutic approaches. Here, we interrogated the T cell receptor (TR) gene repertoire of CLL patients with different genomic aberration profiles aiming to identify unique signatures that would point towards an additional source of immunogenic neoepitopes for T cells. Experimental design: TR gene repertoire profiling using next generation sequencing in groups of patients with CLL carrying one of the following copy-number aberrations (CNAs): del(11q), del(17p), del(13q), trisomy 12, or gene mutations in TP53 or NOTCH1. Results: Oligoclonal expansions were found in all patients with distinct recurrent genomic aberrations; these were more pronounced in cases bearing CNAs, particularly trisomy 12, rather than gene mutations. Shared clonotypes were found both within and across groups, which appeared to be CLL-biased based on extensive comparisons against TR databases from various entities. Moreover, in silico analysis identified TR clonotypes with high binding affinity to neoepitopes predicted to arise from TP53 and NOTCH1 mutations. Conclusions: Distinct TR repertoire profiles were identified in groups of patients with CLL bearing different genomic aberrations, alluding to distinct selection processes. Abnormal protein expression and gene dosage effects associated with recurrent genomic aberrations likely represent a relevant source of CLL-specific selecting antigens.
- S. Hiltemann et al., “Galaxy Training: A powerful framework for teaching!,” PLOS Computational Biology, vol. 19, no. 1, p. e1010752, Jan. 2023, doi: 10.1371/journal.pcbi.1010752.
There is an ongoing explosion of scientific datasets being generated, brought on by recent technological advances in many areas of the natural sciences. As a result, the life sciences have become increasingly computational in nature, and bioinformatics has taken on a central role in research studies. However, basic computational skills, data analysis, and stewardship are still rarely taught in life science educational programs, resulting in a skills gap in many of the researchers tasked with analysing these big datasets. In order to address this skills gap and empower researchers to perform their own data analyses, the Galaxy Training Network (GTN) has previously developed the Galaxy Training Platform (https://training.galaxyproject.org), an open access, community-driven framework for the collection of FAIR (Findable, Accessible, Interoperable, Reusable) training materials for data analysis utilizing the user-friendly Galaxy framework as its primary data analysis platform. Since its inception, this training platform has thrived, with the number of tutorials and contributors growing rapidly, and the range of topics extending beyond life sciences to include topics such as climatology, cheminformatics, and machine learning. While initially aimed at supporting researchers directly, the GTN framework has proven to be an invaluable resource for educators as well. We have focused our efforts in recent years on adding increased support for this growing community of instructors. New features have been added to facilitate the use of the materials in a classroom setting, simplifying the contribution flow for new materials, and have added a set of train-the-trainer lessons. Here, we present the latest developments in the GTN project, aimed at facilitating the use of the Galaxy Training materials by educators, and its usage in different learning environments.
- E. Alloza, J. Lindvall, K. F. Heil, M. Pitoulias, and F. Psomopoulos, “The ELIXIR training platform All Hands Meeting 2023,” May 2023, doi: 10.7490/f1000research.1119426.1.
Presenting high-level developments of the ELIXIR Training Platform during the past year.
- T. Manousaki, E. Pafilis, A. C. Papageorgiou, and F. Psomopoulos, “MBGC: the Molecular Biodiversity Greece Community: a network of networks,” May 2023, doi: 10.7490/f1000research.1119419.1.
In the face of the biodiversity crisis, concerted efforts towards understanding the effects of climate change and habitat loss and fragmentation, both locally and globally, are urgently needed. These are often attempted by leveraging the advances of modern genomics and bioinformatics methodologies. Especially in biodiversity hotspots, the need to understand, monitor and mitigate the loss of biodiversity is pivotal. Greece is a country with especially high endemism. A large percentage of its endemic species is threatened by climate change and human activities. To this end, the national academic community in biodiversity genomics has established a corresponding network of scientists from various Greek research institutes and universities covering different disciplines of biodiversity research. In these slides we are presenting the efforts of an established national Task Force that will channel the flow of information amongst researchers, policy makers, stakeholders and the local society. Our overarching goal is to build a sustainable community and infrastructure for the efficient management of the entire molecular biodiversity data cycle (i.e., from production and storage to the analysis and modelling of data, development of computational tools, and knowledge extraction). Using national and European infrastructures, such as ELIXIR and LifeWatch, we envision to set the ground for studying biodiversity through the lens of biodiversity genomics and offer evidence-based knowledge to guide management of the habitats and the biodiversity they host, as well as the implementation of appropriate policies.
2022
- M. Barker, N. P. Chue Hong, D. S. Katz, A.-L. Lamprecht, C. Martinez-Ortiz, F. Psomopoulos, J. Harrow, L. J. Castro, M. Gruenpeter, P. A. Martinez, and T. Honeyman, “Introducing the FAIR Principles for research software,” Sci. Data, vol. 9, no. 1, p. 622, Oct. 2022, doi: 10.1038/s41597-022-01710-x.
Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR research software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.
- L. Zaragoza-Infante, V. Junet, N. Pechlivanis, S.-C. Fragkouli, S. Amprachamian, T. Koletsa, A. Chatzidimitriou, M. Papaioannou, K. Stamatopoulos, A. Agathangelidis, and F. Psomopoulos, “IgIDivA: immunoglobulin intraclonal diversification analysis,” Briefings in Bioinformatics, Aug. 2022, doi: 10.1093/bib/bbac349.
Intraclonal diversification (ID) within the immunoglobulin (IG) genes expressed by B cell clones arises due to ongoing somatic hypermutation (SHM) in a context of continuous interactions with antigen(s). Defining the nature and order of appearance of SHMs in the IG genes can assist in improved understanding of the ID process, shedding light into the ontogeny and evolution of B cell clones in health and disease. Such endeavor is empowered thanks to the introduction of high-throughput sequencing in the study of IG gene repertoires. However, few existing tools allow the identification, quantification and characterization of SHMs related to ID, all of which have limitations in their analysis, highlighting the need for developing a purpose-built tool for the comprehensive analysis of the ID process. In this work, we present the immunoglobulin intraclonal diversification analysis (IgIDivA) tool, a novel methodology for the in-depth qualitative and quantitative analysis of the ID process from high-throughput sequencing data. IgIDivA identifies and characterizes SHMs that occur within the variable domain of the rearranged IG genes and studies in detail the connections between identified SHMs, establishing mutational pathways. Moreover, it combines established and new graph-based metrics for the objective determination of ID level, combined with statistical analysis for the comparison of ID level features for different groups of samples. Of importance, IgIDivA also provides detailed visualizations of ID through the generation of purpose-built graph networks. Beyond the method design, IgIDivA has been also implemented as an R Shiny web application. IgIDivA is freely available at https://bio.tools/igidiva
- S. Laidou, D. Grigoriadis, S. Papanikolaou, S. Foutadakis, S. Ntoufa, M. Tsagiopoulou, G. Vatsellas, A. Anagnostopoulos, A. Kouvatsi, N. Stavroyianni, F. Psomopoulos, A. M. Makris, M. Agelopoulos, D. Thanos, A. Chatzidimitriou, N. Papakonstantinou, and K. Stamatopoulos, “The TΑp63/BCL2 axis represents a novel mechanism of clinical aggressiveness in chronic lymphocytic leukemia,” Blood Advances, vol. 6, no. 8, pp. 2646–2656, Apr. 2022, doi: 10.1182/bloodadvances.2021006348.
The TA-isoform of the p63 transcription factor (TAp63) has been reported to contribute to clinical aggressiveness in chronic lymphocytic leukemia (CLL) in a hitherto elusive way. Here, we sought to further understand and define the role of TAp63 in the pathophysiology of CLL. First, we found that elevated TAp63 expression levels are linked with adverse clinical outcomes, including disease relapse and shorter time-to-first treatment and overall survival. Next, prompted by the fact that TAp63 participates in an NF-κB/TAp63/BCL2 antiapoptotic axis in activated mature, normal B cells, we explored molecular links between TAp63 and BCL2 also in CLL. We documented a strong correlation at both the protein and the messenger RNA (mRNA) levels, alluding to the potential prosurvival role of TAp63. This claim was supported by inducible downregulation of TAp63 expression in the MEC1 CLL cell line using clustered regularly interspaced short palindromic repeats (CRISPR) system, which resulted in downregulation of BCL2 expression. Next, using chromatin immunoprecipitation (ChIP) sequencing, we examined whether BCL2 might constitute a transcriptional target of TAp63 and identified a significant binding profile of TAp63 in the BCL2 gene locus, across a genomic region previously characterized as a super enhancer in CLL. Moreover, we identified high-confidence TAp63 binding regions in genes mainly implicated in immune response and DNA-damage procedures. Finally, we found that upregulated TAp63 expression levels render CLL cells less responsive to apoptosis induction with the BCL2 inhibitor venetoclax. On these grounds, TAp63 appears to act as a positive modulator of BCL2, hence contributing to the antiapoptotic phenotype that underlies clinical aggressiveness and treatment resistance in CLL.
- N. Pechlivanis, M. Tsagiopoulou, M. C. Maniou, A. Togkousidis, E. Mouchtaropoulou, S. C. Chassalevris Taxiarchisand Chaintoutis, M. Petala, M. Kostoglou, T. Karapantsios, S. Laidou, E. Vlachonikola, A. Chatzidimitriou, A. Papadopoulos, N. Papaioannou, C. I. Dovas, A. Argiriou, and F. Psomopoulos, “Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater and a use-case in the metropolitan area of Thessaloniki, Greece,” Scientific Reports, vol. 12, no. 1, p. 2659, Feb. 2022, doi: 10.1038/s41598-022-06625-6.
The COVID-19 pandemic represents an unprecedented global crisis necessitating novel approaches for, amongst others, early detection of emerging variants relating to the evolution and spread of the virus. Recently, the detection of SARS-CoV-2 RNA in wastewater has emerged as a useful tool to monitor the prevalence of the virus in the community. Here, we propose a novel methodology, called lineagespot, for the monitoring of mutations and the detection of SARS-CoV-2 lineages in wastewater samples using next-generation sequencing (NGS). Our proposed method was tested and evaluated using NGS data produced by the sequencing of 14 wastewater samples from the municipality of Thessaloniki, Greece, covering a 6-month period. The results showed the presence of SARS-CoV-2 variants in wastewater data. lineagespot was able to record the evolution and rapid domination of the Alpha variant (B.1.1.7) in the community, and allowed the correlation between the mutations evident through our approach and the mutations observed in patients from the same area and time periods. lineagespot is an open-source tool, implemented in R, and is freely available on GitHub and registered on bio.tools.
- T. G. Community, “The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update,” Nucleic Acids Research, vol. 50, no. W1, pp. W345–W351, Apr. 2022, doi: 10.1093/nar/gkac247.
Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with \>230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.
- M. Tsagiopoulou, N. Pechlivanis, M. C. Maniou, and F. Psomopoulos, “InterTADs: integration of multi-omics data on topologically associated domains, application to chronic lymphocytic leukemia,” NAR Genomics and Bioinformatics, vol. 4, no. 1, Jan. 2022, doi: 10.1093/nargab/lqab121.
The integration of multi-omics data can greatly facilitate the advancement of research in Life Sciences by highlighting new interactions. However, there is currently no widespread procedure for meaningful multi-omics data integration. Here, we present a robust framework, called InterTADs, for integrating multi-omics data derived from the same sample, and considering the chromatin configuration of the genome, i.e. the topologically associating domains (TADs). Following the integration process, statistical analysis highlights the differences between the groups of interest (normal versus cancer cells) relating to (i) independent and (ii) integrated events through TADs. Finally, enrichment analysis using KEGG database, Gene Ontology and transcription factor binding sites and visualization approaches are available. We applied InterTADs to multi-omics datasets from 135 patients with chronic lymphocytic leukemia (CLL) and found that the integration through TADs resulted in a dramatic reduction of heterogeneity compared to individual events. Significant differences for individual events and on TADs level were identified between patients differing in the somatic hypermutation status of the clonotypic immunoglobulin genes, the core biological stratifier in CLL, attesting to the biomedical relevance of InterTADs. In conclusion, our approach suggests a new perspective towards analyzing multi-omics data, by offering reasonable execution time, biological benchmarking and potentially contributing to pattern discovery through TADs.
- A. Nicolaidis and F. Psomopoulos, “DNA coding and Gödel numbering,” Physica A: Statistical Mechanics and its Applications, vol. 594, p. 127053, 2022, doi: 10.1016/j.physa.2022.127053.
We consider a DNA strand as a mathematical statement. Inspired by the work of Kurt Gödel, we attach to each DNA strand a Gödel’s number, a product of prime numbers raised to appropriate powers. To each DNA chain corresponds a single Gödel’s number G, and inversely given a Gödel’s number G, we can specify the DNA chain it stands for. Next, considering a single DNA strand composed of N bases, we study the statistical distribution of g, the logarithm of G. Our assumption is that the choice of the mth term is random and with equal probability for the four possible outcomes. The ‘experiment’, to some extent, is similar to throwing N times a four-faces die. Through the moment generating function we obtain the discrete and then the continuum distribution of g. There is an excellent agreement between our formalism and simulated data. At the end we compare our formalism to actual data, to specify the presence of non-random fluctuations.
- O. Giraldo, R. Alves, D. Bampalikis, J. Fernandez, E. Martin del Pico, F. Psomopoulos, A. Via, and L. J. Castro, “A FAIRification roadmap for ELIXIR Software Management Plans.,” Research Ideas and Outcomes, vol. 8, p. e94608, 2022, doi: 10.3897/rio.8.e94608.
Academic research requires careful handling of data plus any means to collect, transform and publish it, activities commonly supported by research software (from scripts to end-user applications). Data Management Plans (DMPs) are nowadays commonly requested by funders as part of good research practices. A DMP describes the data management lifecycle for the data corresponding to a research project, covering activities from collection to publication and preservation. To support and improve transparency, open science, reproducibility (and other *ilities), data needs to be accompanied by the software transforming it. Similar to DMPs, Software Management Plans (SMPs) can help formalize a set of structures and goals ensuring that the software is accessible and reusable in the short, medium and long term. DMPs and SMPs can be presented as text-based documents, guided by a set of questions corresponding to key points related to the lifecycle of either data or software. A step forward for DMPs are the machine-actionable DMPs (maDMPs) proposed by the Research Data Alliance DMP Common Standards Working Group. A maDMP corresponds to a structured representation of the most common elements present in a DMP (Miksa et al. 2020b), overcoming some obstacles linked to text-based representation. Such a structured representation makes it easier for DMPs to become readable and reusable for both humans and machines alike. The DMP Common Standard ontology (DCSO) (Cardoso et al. 2022) further supports maDMPs as it makes it easier to extend the original maDMP application profile to cover additional elements related to, for instance, SMPs or specific requirements from funders. maDMPs can be combined with the notion of a Research Object Crates (RO-Crate) to automate and ease management of research data (Miksa et al. 2020a). An RO-Crate (Soiland-Reyes et al. 2022) is an open, community-driven, and lightweight approach based on schema.org (Guha et al. 2016) annotations in JSON-LD to package research data (or any other research digital object) together with its metadata in a machine-readable manner. The ELIXIR SMP has been developed by the ELIXIR Software Development Best Practices Group in the ELIXIR Tools Platform to support researchers in life sciences (Alves et al. 2021). The ELIXIR SMP aims at making it easier to follow research software good practices aligned to the findable, accessible, interoperable and reusable principles for research software (FAIR4RS) (Chue Hong et al. 2022) while dealing with the lifecycle of research software. Its primary goal is encouraging a wider adoption by life science researchers, and being as inclusive as possible to the various levels of technical expertise. Here we present a roadmap for ELIXIR SMPs to become a FAIR digital object (FDO) (Schultes and Wittenburg 2019) based on the extension of maDMPs and DCSO and the use of RO-Crates. FDOs have been proposed as a way to package digital objects together with their metadata, types, identifiers and operations, so they become more machine-actionable and auto-contained. The current version of the ELIXIR SMP includes seven sections: accessibility and licensing, documentation, testing, interoperability, versioning, reproducibility, and recognition. Each section includes questions guiding and supporting researchers so they cover key aspects of the software lifecycle relevant to their own case. To lower the barrier and make it easier for researchers, most questions are Yes/No with some few offering a set of options. In some cases, a URL is also requested, for instance regarding the location of the documentation for end-users. Our roadmap for ELIXIR SMPs to move from a text-based questionnaire to an FDO comprises four main steps: creating maSMP application profile,extending DCSO, mapping to schema.org, andusing RO-Crates. Our maSMP application profile will include the semantic representation of the structured metadata that comes from the ELIXIR SMP. We will add granularity to the current root of the DCSO (dcso:DMP), by proposing the term SMP. In addition, we will propose the term ResearchSoftware as a dcso:Dataset. Terminology related to documentation, such as “Objective” will also be considered. The objective is the Why the research software, which is crucial for their comprehensibility. We will propose the term DatasetObjective as the reason for the creation of a dataset. Source-codeRepository and Source-codeTesting are also good candidates to be part of the DCSO extension. We will extend DCSO with new classes and properties as necessary to include the software related elements mentioned in the maSMP application profile. As the ELIXIR SMP targets the life science community, we will analyze the need to add links from DCSO to ontologies describing common operations, activities, and types in this domain. One important aspect is the creation of a mapping from DCSO to schema.org. Schema.org has become a popular choice to add lightway semantics to web pages but can also be used on its own to provide metadata describing all sorts of objects. In life sciences, Bioschemas (Gray et al. 2017) offers guidelines on how to use some of the schema.org types aligned to this domain. Bioschemas includes a set of profiles, including minimum, recommended and optional properties, that have been agreed to and adopted by the community, for instance the ComputationalTool profile provides a way to describe software tools and applications. Bioschemas promotes its adoption by key resources in Life Sciences and development of tools such as the Bioschemas Markup Scraper and Extractor (BMUSE) used for the harvesting of the data (Gray et al. 2022). Our final step for ELIXIR SMPs to become an FDO is using RO-Crates to package research software together with its metadata and link it to/from its corresponding SMP. To do so, we will create an RO-Crate profile capturing the metadata needed to describe software tools including elements from the SMP. It will become a versioned living crate as research software evolves with time, particularly when new releases are published. Thanks to the RO-Crate bundling nature, where digital objects are packed together with its metadata, a software crate enriched with the elements from the SMP are a good example of an FDO as all the critical information about a software tool is bound together in a unit that can be shared with peers via FAIR registries and repositories.
- C. Martinez-Ortiz, C. Goble, D. Katz, T. Honeyman, P. Martinez, M. Barker, L. J. Castro, N. Chue Hong, M. Gruenpeter, and J. Harrow, “How does software fit into the FDO landscape?,” Research Ideas and Outcomes, vol. 8, p. e95724, 2022, doi: 10.3897/rio.8.e95724.
In academic research virtually every field has increased its use of digital and computational technology, leading to new scientific discoveries, and this trend is likely to continue. Reliable and efficient scholarly research requires researchers to be able to validate and extend previously generated research results. In the digital era, this implies that digital objectsKahn and Wilensky 2006 used in research should be Findable, Accessible, Interoperable and Reusable (FAIR). These objects include (but are not limited to) data, software, models (for example, machine learning), representations of physical objects, virtual research environments, workflows, etc. Leaving any of these digital objects out of the FAIR process may result in a loss of academic rigor and may have severe consequences in the long term for the field, such as a reproducibility crisis. In this extended abstract, we focus on research software as a FAIR digital object (FDO). The FDO framework De Smedt et al. 2020 describes FDOs as being actionable units of knowledge, which can be aggregated, analyzed, and processed by different types of algorithms. Such algorithms must be implemented by software in one form or another. The framework also describes large software stacks supporting FDOs enabling responsible data science and increasing reproducibility. This implies that software is a key ingredient of the FDO framework, and should adhere to the FAIR principles. Software plays multiple roles: it is a DO itself, it is responsible for creating new FDOs (e.g., data) and it helps to make them available to the public (e.g., via repositories and registries). However there is a need to specify in more detail how non-data DOs, in particular software, fit in this framework. Different classes of digital objects have different intrinsic properties and ways to relate to other DOs. This means that while they, in principle, are subject to the high-level FAIR principles, there are also differences depending on their type and properties, requiring an adaptation so FAIR implementations are more aligned to the digital object itself. This holds true in particular to software. Software has intrinsic properties (executability, composite nature, development practices, continuous evolution and versioning, and packaging and distribution) and specific needs that must be considered by the FDO framework. For example, open source software is typically developed in the open on social coding platforms, where releases are distributed through package management systems, unlike data that is typically published in archival repositories. These social coding platforms do not provide long term archiving, permanent identifiers, or metadata, and package management systems, while somewhat better, similarly do not make a commitment to long term archiving, do not use identifiers that fit the scholarly publication system well, and provide metadata that may be missing key elements. The FAIR for research software (FAIR4RS, Chue Hong et al. 2021) working group has dedicated significant effort in building a community consensus around developing FAIR principles that are customized for research software, providing methods for researchers to understand and address these gaps. In this presentation we will highlight the importance of software for the FAIR landscape and why different (but related) FAIR principles are needed for software (vs those originally developed for data). Our goal here is to contribute to building an FDO landscape together, where we consider all different types of digital objects that are essential in today’s research, and we are enthusiastic about contributing our expertise on research software in helping shape this landscape.
- A. Mitsigkolas, N. Pechlivanis, and F. Psomopoulos, “Assessing SARS-CoV-2 evolution through the analysis of emerging mutations,” Oct. 2022, doi: 10.1101/2022.10.25.513701.
The number of studies on SARS-CoV-2 published on a daily basis is constantly increasing, in an attempt to understand and address the challenges posed by the pandemic in a better way. Most of these studies also include a phylogeny of SARS-CoV-2 as background context, always taking into consideration the latest data in order to construct an updated tree. However, some of these studies have also revealed the difficulties of inferring a reliable phylogeny. [13] have shown that reliable phylogeny is an inherently complex task due to the large number of highly similar sequences, given the relatively low number of mutations evident in each sequence. From this viewpoint, there is indeed a challenge and an opportunity in identifying the evolutionary history of the SARS-CoV-2 virus, in order to assist the phylogenetic analysis process as well as support researchers in keeping track of the virus and the course of its characteristic mutations, and in finding patterns of the emerging mutations themselves and the interactions between them. The research question is formulated as follows: Detecting new patterns of co-occurring mutations beyond the strain-specific / strain-defining ones, in SARS-CoV-2 data, through the application of ML methods. Going beyond the traditional phylogenetic approaches, we will be designing and implementing a clustering method that will effectively create a dendrogram of the involved sequences, based on a feature space defined on the present mutations, rather than the entire sequence. Ultimately, this ML method is tested out in sequences retrieved from public databases and validated using the available metadata as labels. The main goal of the project is to design, implement and evaluate a software that will automatically detect and cluster relevant mutations, that could potentially be used to identify trends in emerging variants.
2021
- M. Osathanunkul, N. Sawongta, W. Pheera, N. Pechlivanis, F. Psomopoulos, and P. Madesis, “Exploring plant diversity through soil DNA in Thai national parks for influencing land reform and agriculture planning,” PeerJ, vol. 9, p. e11753, Aug. 2021, doi: 10.7717/peerj.11753.
The severe deforestation, as indicated in national forest data, is a recurring problem in many areas of Northern Thailand, including Doi Suthep-Pui National Park. Agricultural expansion in these areas, is one of the major drivers of deforestation, having adverse consequences on local plant biodiversity. Conserving biodiversity is mainly dependent on the biological monitoring of species distribution and population sizes. However, the existing conventional approaches for monitoring biodiversity are rather limited. Here, we explored soil DNA at four forest types in Doi Suthep-Pui National Park in Northern Thailand. Three soil samples, composed of different soil cores mixed together, per sampling location were collected. Soil biodiversity was investigated through eDNA metabarcoding analysis using primers targeting the P6 loop of the plastid DNA trnL (UAA) intron. The distribution of taxa for each sample was found to be similar between replicates. A strong congruence between the conventional morphology- and eDNA-based data of plant diversity in the studied areas was observed. All species recorded by conventional survey with DNA data deposited in the GenBank were detected through the eDNA analysis. Moreover, traces of crops, such as lettuce, maize, wheat and soybean, which were not expected and were not visually detected in the forest area, were identified. It is noteworthy that neighboring land and areas in the studied National Park were once used for crop cultivation, and even to date there is still agricultural land within a 5–10 km radius from the forest sites where the soil samples were collected. The presence of cultivated area near the forest may suggest that we are now facing agricultural intensification leading to deforestation. Land reform for agriculture usage necessitates coordinated planning in order to preserve the forest area. In that context, the eDNA-based data would be useful for influencing policies and management towards this goal.
- A. C. Dimopoulos, K. Koukoutegos, F. E. Psomopoulos, and P. Moulos, “Combining Multiple RNA-Seq Data Analysis Algorithms Using Machine Learning Improves Differential Isoform Expression Analysis,” Methods and Protocols, vol. 4, no. 4, 2021, doi: 10.3390/mps4040068.
RNA sequencing has become the standard technique for high resolution genome-wide monitoring of gene expression. As such, it often comprises the first step towards understanding complex molecular mechanisms driving various phenotypes, spanning organ development to disease genesis, monitoring and progression. An advantage of RNA sequencing is its ability to capture complex transcriptomic events such as alternative splicing which results in alternate isoform abundance. At the same time, this advantage remains algorithmically and computationally challenging, especially with the emergence of even higher resolution technologies such as single-cell RNA sequencing. Although several algorithms have been proposed for the effective detection of differential isoform expression from RNA-Seq data, no widely accepted golden standards have been established. This fact is further compounded by the significant differences in the output of different algorithms when applied on the same data. In addition, many of the proposed algorithms remain scarce and poorly maintained. Driven by these challenges, we developed a novel integrative approach that effectively combines the most widely used algorithms for differential transcript and isoform analysis using state-of-the-art machine learning techniques. We demonstrate its usability by applying it on simulated data based on several organisms, and using several performance metrics; we conclude that our strategy outperforms the application of the individual algorithms. Finally, our approach is implemented as an R Shiny application, with the underlying data analysis pipelines also available as docker containers.
- M. Tsagiopoulou, A. Togkousidis, N. Pechlivanis, M. C. Maniou, A. Batsali, A. Matheakakis, C. Pontikoglou, and F. Psomopoulos, “miRkit: R Framework Analyzing miRNA PCR Array Data,” BMC Research Notes, vol. 14, no. 376, Sep. 2021, doi: 10.1186/s13104-021-05788-1.
The characterization of microRNAs (miRNA) in recent years is an important advance in the field of gene regulation. To this end, several approaches for miRNA expression analysis and various bioinformatics tools have been developed over the last few years. It is a common practice to analyze miRNA PCR Array data using the commercially available software, mostly due to its convenience and ease-of-use.
- I. Walsh et al., “DOME: recommendations for supervised machine learning validation in biology,” Nature Methods, Jul. 2021, doi: 10.1038/s41592-021-01205-4.
With the steep decline in the cost of many high-throughput technologies, large amounts of biological data are being generated and made accessible to researchers. Machine learning (ML) has come into the spotlight as a very useful approach for understanding cellular, genomic, proteomic, post-translational, metabolic and drug discovery data, with the potential to result in ground-breaking medical applications. This is clearly reflected in the corresponding growth of ML publications (Fig. 1), reporting a wide range of modeling techniques in biology. While ideally ML methods should be validated experimentally, this happens only in a fraction of the publications. We believe that the time is right for the ML community to develop standards for reporting ML-based analyses to enable critical assessment10 and improve reproducibility
- S. Ntoufa, M. Gerousi, S. Laidou, F. Psomopoulos, G. Tsiolas, T. Moysiadis, N. Papakonstantinou, L. Mansouri, A. Anagnostopoulos, N. Stavrogianni, S. Pospisilova, K. Plevova, A. M. Makris, R. Rosenquist, and K. Stamatopoulos, “RPS15 mutations rewire RNA translation in chronic lymphocytic leukemia,” Blood Advances, vol. 5, no. 13, pp. 2788–2792, Jul. 2021, doi: 10.1182/bloodadvances.2020001717.
Recent studies of chronic lymphocytic leukemia (CLL) have reported recurrent mutations in the RPS15 gene, which encodes the ribosomal protein S15 (RPS15), a component of the 40S ribosomal subunit. Despite some evidence about the role of mutant RPS15 (mostly obtained from the analysis of cell lines), the precise impact of RPS15 mutations on the translational program in primary CLL cells remains largely unexplored. Here, using RNA sequencing and ribosome profiling, a technique that involves measuring translational efficiency, we sought to obtain global insight into changes in translation induced by RPS15 mutations in CLL cells. To this end, we evaluated primary CLL cells from patients with wild-type or mutant RPS15 as well as MEC1 CLL cells transfected with mutant or wild-type RPS15. Our data indicate that RPS15 mutations rewire the translation program of primary CLL cells by reducing their translational efficiency, an effect not seen in MEC1 cells. In detail, RPS15 mutant primary CLL cells displayed altered translation efficiency of other ribosomal proteins and regulatory elements that affect key cell processes, such as the translational machinery and immune signaling, as well as genes known to be implicated in CLL, hence highlighting a relevant role for RPS15 in the natural history of CLL.
- N. Pechlivanis, A. Togkousidis, M. Tsagiopoulou, S. Sgardelis, I. Kappas, and F. Psomopoulos, “A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data,” Frontiers in Genetics, vol. 12, May 2021, doi: 10.3389/fgene.2021.618170.
The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer.
- M. Tsagiopoulou, M. C. Maniou, N. Pechlivanis, A. Togkousidis, M. Kotrová, T. Hutzenlaub, I. Kappas, A. Chatzidimitriou, and F. Psomopoulos, “UMIc: A Preprocessing Method for UMI Deduplication and Reads Correction,” Frontiers in Genetics, vol. 12, May 2021, doi: 10.3389/fgene.2021.660366.
A recent refinement in high-throughput sequencing involves the incorporation of unique molecular identifiers (UMIs), which are random oligonucleotide barcodes, on the library preparation steps. A UMI adds a unique identity to different DNA/RNA input molecules through polymerase chain reaction (PCR) amplification, thus reducing bias of this step. Here, we propose an alignment free framework serving as a preprocessing step of fastq files, called UMIc, for deduplication and correction of reads building consensus sequences from each UMI. Our approach takes into account the frequency and the Phred quality of nucleotides and the distances between the UMIs and the actual sequences. We have tested the tool using different scenarios of UMI-tagged library data, having in mind the aspect of a wide application. UMIc is an open-source tool implemented in R and is freely available from https://github.com/BiodataAnalysisGroup/UMIc.
- K. Gemenetzi, F. Psomopoulos, A. A. Carriles, M. Gounari, C. Minici, K. Plevova, L.-A. Sutton, M. Tsagiopoulou, P. Baliakas, K. Pasentsis, A. Anagnostopoulos, R. Sandaltzopoulos, R. Rosenquist, F. Davi, S. Pospisilova, P. Ghia, K. Stamatopoulos, M. Degano, and A. Chatzidimitriou, “Higher-order immunoglobulin repertoire restrictions in CLL: the illustrative case of stereotyped subsets 2 and 169,” Blood, vol. 137, no. 14, pp. 1895–1904, Apr. 2021, doi: 10.1182/blood.2020005216.
Chronic lymphocytic leukemia (CLL) major stereotyped subset 2 (IGHV3-21/IGLV3-21, ∼2.5% of all cases of CLL) is an aggressive disease variant, irrespective of the somatic hypermutation (SHM) status of the clonotypic IGHV gene. Minor stereotyped subset 169 (IGHV3-48/IGLV3-21, ∼0.2% of all cases of CLL) is related to subset 2, as it displays a highly similar variable antigen-binding site. We further explored this relationship through next-generation sequencing and crystallographic analysis of the clonotypic B-cell receptor immunoglobulin. Branching evolution of the predominant clonotype through intraclonal diversification in the context of ongoing SHM was evident in both heavy and light chain genes of both subsets. Molecular similarities between the 2 subsets were highlighted by the finding of shared SHMs within both the heavy and light chain genes in all analyzed cases at either the clonal or subclonal level. Particularly noteworthy in this respect was a ubiquitous SHM at the linker region between the variable and the constant domain of the IGLV3-21 light chains, previously reported as critical for immunoglobulin homotypic interactions underlying cell-autonomous signaling capacity. Notably, crystallographic analysis revealed that the IGLV3-21–bearing CLL subset 169 immunoglobulin retains the same geometry and contact residues for the homotypic intermolecular interaction observed in subset 2, including the SHM at the linker region, and, from a molecular standpoint, belong to a common structural mode of autologous recognition. Collectively, our findings document that stereotyped subsets 2 and 169 are very closely related, displaying shared immunoglobulin features that can be explained only in the context of shared functional selection.
- M. Velegraki, N. Papakonstantinou, L. Kalaitzaki, S. Ntoufa, S. Laidou, M. Tsagiopoulou, N. Bizymi, A. Damianaki, I. Mavroudi, C. Pontikoglou, and H. A. Papadaki, “Increased proportion and altered properties of intermediate monocytes in the peripheral blood of patients with lower risk Myelodysplastic Syndrome,” Blood Cells, Molecules, and Diseases, vol. 86, p. 102507, Feb. 2021, doi: 10.1016/j.bcmd.2020.102507.
Immune deregulation has a critical role in the pathogenesis of lower risk myelodysplastic syndromes (MDS). The cells of the macrophage/monocyte lineage have been reported to contribute to the inflammatory process in MDS through impaired phagocytosis of the apoptotic hemopoietic cells and abnormal production of cytokines. In the present study we assessed the number of peripheral blood (PB) monocyte subsets, namely the classical CD14bright/CD16−, intermediate CD14bright/CD16+ and non-classical CD14dim/CD16+ cells, in patients with lower risk (low/intermediate-I) MDS (n = 32). We also assessed the production of tumor necrosis factor (TNF)α by patient PB monocytes in response to immune stimulus as well as their transcriptome profile. Compared to age- and sex-matched healthy individuals (n = 19), MDS patients had significantly lower number of classical and increased number of intermediate monocytes. Patient intermediate monocytes displayed increased production of TNFα following stimulation with lipopolysaccharide, compared to healthy individuals. Transcriptional profiling comparison of CD16+ monocytes from patients and controls revealed 43 differentially expressed genes mostly associated with biological pathways/processes relevant to hemopoiesis, immune signaling and cell adhesion. These data provide evidence for the first-time that distinct monocyte subsets display abnormal quantitative and functional characteristics in lower risk MDS substantiating their role in the immune deregulation associated with the disease.
- M. Gerousi, F. Psomopoulos, K. Kotta, M. Tsagiopoulou, N. Stavroyianni, A. Anagnostopoulos, A. Anastasiadis, M. Gkanidou, I. Kotsianidis, S. Ntoufa, and K. Stamatopoulos, “The Calcitriol/Vitamin D Receptor System Regulates Key Immune Signaling Pathways in Chronic Lymphocytic Leukemia,” Cancers, vol. 13, no. 2, 2021, doi: 10.3390/cancers13020285.
It has been proposed that vitamin D may play a role in prevention and treatment of cancer while epidemiological studies have linked vitamin D insufficiency to adverse disease outcomes in various B cell malignancies, including chronic lymphocytic leukemia (CLL). In this study, we sought to obtain deeper biological insight into the role of vitamin D and its receptor (VDR) in the pathophysiology of CLL. To this end, we performed expression analysis of the vitamin D pathway molecules; complemented by RNA-Sequencing analysis in primary CLL cells that were treated in vitro with calcitriol, the biologically active form of vitamin D. In addition, we examined calcitriol effects ex vivo in CLL cells cultured in the presence of microenvironmental signals, namely anti-IgM/CD40L, or co-cultured with the supportive HS-5 cells; and, CLL cells from patients under ibrutinib treatment. Our study reports that the calcitriol/VDR system is functional in CLL regulating signaling pathways critical for cell survival and proliferation, including the TLR and PI3K/AKT pathways. Moreover, calcitriol action is likely independent of the microenvironmental signals in CLL, since it was not significantly affected when combined with anti-IgM/CD40L or in the context of the co-culture system. This finding was also supported by our finding of preserved calcitriol signaling capacity in CLL patients under ibrutinib treatment. Overall, our results indicate a relevant biological role for vitamin D in CLL pathophysiology and allude to the potential clinical utility of vitamin D supplementation in patients with CLL.
2020
- A. Agathangelidis, C. Galigalidou, L. Scarfò, T. Moysiadis, A. Rovida, M. Gounari, F. Psomopoulos, P. Ranghetti, A. Galanis, F. Davi, K. Stamatopoulos, A. Chatzidimitriou, and P. Ghia, “Infrequent ‘chronic lymphocytic leukemia-specific’ immunoglobulin stereotypes in aged individuals with or without low-count monoclonal B-cell lymphocytosis,” Haematologica, vol. 106, no. 4, pp. 1178–1181, Jun. 2020, doi: 10.3324/haematol.2020.247908.
Chronic lymphocytic leukemia (CLL) is a chronic, incurable malignancy of antigen-experienced B cells, mainly affecting the aged population.1 Immunogenetic analysis in CLL revealed the existence of subsets of patients expressing stereotyped B-cell receptor immunoglobulins (BcR IG),2 which represent homogeneous CLL variants with distinct biological and clinical characteristics.3 Little is known regarding the presence of “CLL-specific”, stereotyped BcR IG within the repertoire of healthy individuals. Low-throughput studies4 led to the identification of cases with stereotyped BcR IG, followed by next-generation sequencing studies that found CLL stereotypes in normal B-cell populations, albeit at very low frequencies.5
- A. Vardi et al., “T-Cell Dynamics in Chronic Lymphocytic Leukemia under Different Treatment Modalities,” Clinical Cancer Research, vol. 26, no. 18, pp. 4958–4969, 2020, doi: 10.1158/1078-0432.CCR-19-3827.
Purpose: Using next-generation sequencing (NGS), we recently documented T-cell oligoclonality in treatment-naı̈ve chronic lymphocytic leukemia (CLL), with evidence indicating T-cell selection by restricted antigens.Experimental Design: Here, we sought to comprehensively assess T-cell repertoire changes during treatment in relation to (i) treatment type [fludarabine-cyclophosphamide-rituximab (FCR) versus ibrutinib (IB) versus rituximab-idelalisib (R-ID)], and (ii) clinical response, by combining NGS immunoprofiling, flow cytometry, and functional bioassays.Results: T-cell clonality significantly increased at (i) 3 months in the FCR and R-ID treatment groups, and (ii) over deepening clinical response in the R-ID group, with a similar trend detected in the IB group. Notably, in constrast to FCR that induced T-cell repertoire reconstitution, B-cell receptor signaling inhibitors (BcRi) preserved pretreatment clones. Extensive comparisons both within CLL as well as against T-cell receptor sequence databases showed little similarity with other entities, but instead revealed major clonotypes shared exclusively by patients with CLL, alluding to selection by conserved CLL-associated antigens. We then evaluated the functional effect of treatments on T cells and found that (i) R-ID upregulated the expression of activation markers in effector memory T cells, and (ii) both BcRi improved antitumor T-cell immune synapse formation, in marked contrast to FCR.Conclusions: Taken together, our NGS immunoprofiling data suggest that BcRi retain T-cell clones that may have developed against CLL-associated antigens. Phenotypic and immune synapse bioassays support a concurrent restoration of functionality, mostly evident for R-ID, arguably contributing to clinical response.
- C. C. Austin et al., “Fostering global data sharing: highlighting the recommendations of the Research Data Alliance COVID-19 working group [version 1; peer review: 1 approved, 2 approved with reservations],” Wellcome Open Research, vol. 5, no. 267, 2020, doi: 10.12688/wellcomeopenres.16378.1.
The systemic challenges of the COVID-19 pandemic require cross-disciplinary collaboration in a global and timely fashion. Such collaboration needs open research practices and the sharing of research outputs, such as data and code, thereby facilitating research and research reproducibility and timely collaboration beyond borders. The Research Data Alliance COVID-19 Working Group recently published a set of recommendations and guidelines on data sharing and related best practices for COVID-19 research. These guidelines include recommendations for clinicians, researchers, policy- and decision-makers, funders, publishers, public health experts, disaster preparedness and response experts, infrastructure providers from the perspective of different domains (Clinical Medicine, Omics, Epidemiology, Social Sciences, Community Participation, Indigenous Peoples, Research Software, Legal and Ethical Considerations), and other potential users. These guidelines include recommendations for researchers, policymakers, funders, publishers and infrastructure providers from the perspective of different domains (Clinical Medicine, Omics, Epidemiology, Social Sciences, Community Participation, Indigenous Peoples, Research Software, Legal and Ethical Considerations). Several overarching themes have emerged from this document such as the need to balance the creation of data adherent to FAIR principles (findable, accessible, interoperable and reusable), with the need for quick data release; the use of trustworthy research data repositories; the use of well-annotated data with meaningful metadata; and practices of documenting methods and software. The resulting document marks an unprecedented cross-disciplinary, cross-sectoral, and cross-jurisdictional effort authored by over 160 experts from around the globe. This letter summarises key points of the Recommendations and Guidelines, highlights the relevant findings, shines a spotlight on the process, and suggests how these developments can be leveraged by the wider scientific community.
- M. T. Kotouza, K. Gemenetzi, C. Galigalidou, E. Vlachonikola, N. Pechlivanis, A. Agathangelidis, R. Sandaltzopoulos, P. A. Mitkas, K. Stamatopoulos, A. Chatzidimitriou, and F. E. Psomopoulos, “TRIP - T cell receptor/immunoglobulin profiler,” BMC Bioinformatics, vol. 21, no. 422, Sep. 2020, doi: 10.1186/s12859-020-03669-1.
Antigen receptors are characterized by an extreme diversity of specificities, which poses major computational and analytical challenges, particularly in the era of high-throughput immunoprofiling by next generation sequencing (NGS). The T cell Receptor/Immunoglobulin Profiler (TRIP) tool offers the opportunity for an in-depth analysis based on the processing of the output files of the IMGT/HighV-Quest tool, a standard in NGS immunoprofiling, through a number of interoperable modules. These provide detailed information about antigen receptor gene rearrangements, including variable (V), diversity (D) and joining (J) gene usage, CDR3 amino acid and nucleotide composition and clonality of both T cell receptors (TR) and B cell receptor immunoglobulins (BcR IG), and characteristics of the somatic hypermutation within the BcR IG genes. TRIP is a web application implemented in R shiny.
- A.-C. Vagiona, M. A. Andrade-Navarro, F. Psomopoulos, and S. Petrakis, “Dynamics of a Protein Interaction Network Associated to the Aggregation of polyQ-Expanded Ataxin-1,” Genes, vol. 11, no. 10, p. 1129, Sep. 2020, doi: 10.3390/genes11101129.
Background: Several experimental models of polyglutamine (polyQ) diseases have been previously developed that are useful for studying disease progression in the primarily affected central nervous system. However, there is a missing link between cellular and animal models that would indicate the molecular defects occurring in neurons and are responsible for the disease phenotype in vivo. Methods: Here, we used a computational approach to identify dysregulated pathways shared by an in vitro and an in vivo model of ATXN1(Q82) protein aggregation, the mutant protein that causes the neurodegenerative polyQ disease spinocerebellar ataxia type-1 (SCA1). Results: A set of common dysregulated pathways were identified, which were utilized to construct cerebellum-specific protein-protein interaction (PPI) networks at various time-points of protein aggregation. Analysis of a SCA1 network indicated important nodes which regulate its function and might represent potential pharmacological targets. Furthermore, a set of drugs interacting with these nodes and predicted to enter the blood–brain barrier (BBB) was identified. Conclusions: Our study points to molecular mechanisms of SCA1 linked from both cellular and animal models and suggests drugs that could be tested to determine whether they affect the aggregation of pathogenic ATXN1 and SCA1 disease progression.
- F. E. Psomopoulos, J. van Helden, C. Médigue, A. Chasapi, and C. A. Ouzounis, “Ancestral state reconstruction of metabolic pathways across pangenome ensembles,” 2020, doi: 10.1099/mgen.0.000429.
As genome sequencing efforts are unveiling the genetic diversity of the biosphere with an unprecedented speed, there is a need to accurately describe the structural and functional properties of groups of extant species whose genomes have been sequenced, as well as their inferred ancestors, at any given taxonomic level of their phylogeny. Elaborate approaches for the reconstruction of ancestral states at the sequence level have been developed, subsequently augmented by methods based on gene content. While these approaches of sequence or gene-content reconstruction have been successfully deployed, there has been less progress on the explicit inference of functional properties of ancestral genomes, in terms of metabolic pathways and other cellular processes. Herein, we describe PathTrace, an efficient algorithm for parsimony-based reconstructions of the evolutionary history of individual metabolic pathways, pivotal representations of key functional modules of cellular function. The algorithm is implemented as a five-step process through which pathways are represented as fuzzy vectors, where each enzyme is associated with a taxonomic conservation value derived from the phylogenetic profile of its protein sequence. The method is evaluated with a selected benchmark set of pathways against collections of genome sequences from key data resources. By deploying a pangenome-driven approach for pathway sets, we demonstrate that the inferred patterns are largely insensitive to noise, as opposed to gene-content reconstruction methods. In addition, the resulting reconstructions are closely correlated with the evolutionary distance of the taxa under study, suggesting that a diligent selection of target pangenomes is essential for maintaining cohesiveness of the method and consistency of the inference, serving as an internal control for an arbitrary selection of queries. The PathTrace method is a first step towards the large-scale analysis of metabolic pathway evolution and our deeper understanding of functional relationships reflected in emerging pangenome collections.
- K. T. Gurwitz et al., “A framework to assess the quality and impact of bioinformatics training across ELIXIR,” PLOS Computational Biology, vol. 16, no. 7, pp. 1–12, Jul. 2020, doi: 10.1371/journal.pcbi.1007976.
ELIXIR is a pan-European intergovernmental organisation for life science that aims to coordinate bioinformatics resources in a single infrastructure across Europe; bioinformatics training is central to its strategy, which aims to develop a training community that spans all ELIXIR member states. In an evidence-based approach for strengthening bioinformatics training programmes across Europe, the ELIXIR Training Platform, led by the ELIXIR EXCELERATE Quality and Impact Assessment Subtask in collaboration with the ELIXIR Training Coordinators Group, has implemented an assessment strategy to measure quality and impact of its entire training portfolio. Here, we present ELIXIR’s framework for assessing training quality and impact, which includes the following: specifying assessment aims, determining what data to collect in order to address these aims, and our strategy for centralised data collection to allow for ELIXIR-wide analyses. In addition, we present an overview of the ELIXIR training data collected over the past 4 years. We highlight the importance of a coordinated and consistent data collection approach and the relevance of defining specific metrics and answer scales for consortium-wide analyses as well as for comparison of data across iterations of the same course.
- L. Garcia et al., “Ten simple rules for making training materials FAIR,” PLOS Computational Biology, vol. 16, no. 5, pp. 1–9, May 2020, doi: 10.1371/journal.pcbi.1007854.
Author summary Everything we do today is becoming more and more reliant on the use of computers. The field of biology is no exception; but most biologists receive little or no formal preparation for the increasingly computational aspects of their discipline. In consequence, informal training courses are often needed to plug the gaps; and the demand for such training is growing worldwide. To meet this demand, some training programs are being expanded, and new ones are being developed. Key to both scenarios is the creation of new course materials. Rather than starting from scratch, however, it’s sometimes possible to repurpose materials that already exist. Yet finding suitable materials online can be difficult: They’re often widely scattered across the internet or hidden in their home institutions, with no systematic way to find them. This is a common problem for all digital objects. The scientific community has attempted to address this issue by developing a set of rules (which have been called the Findable, Accessible, Interoperable and Reusable [FAIR] principles) to make such objects more findable and reusable. Here, we show how to apply these rules to help make training materials easier to find, (re)use, and adapt, for the benefit of all.
- L. Stamatia et al., “Nuclear inclusions of pathogenic ataxin-1 induce oxidative stress and perturb the protein synthesis machinery,” Redox Biology, vol. 32, p. 101458, 2020, doi: 10.1016/j.redox.2020.101458.
Spinocerebellar ataxia type-1 (SCA1) is caused by an abnormally expanded polyglutamine (polyQ) tract in ataxin-1. These expansions are responsible for protein misfolding and self-assembly into intranuclear inclusion bodies (IIBs) that are somehow linked to neuronal death. However, owing to lack of a suitable cellular model, the downstream consequences of IIB formation are yet to be resolved. Here, we describe a nuclear protein aggregation model of pathogenic human ataxin-1 and characterize IIB effects. Using an inducible Sleeping Beauty transposon system, we overexpressed the ATXN1(Q82) gene in human mesenchymal stem cells that are resistant to the early cytotoxic effects caused by the expression of the mutant protein. We characterized the structure and the protein composition of insoluble polyQ IIBs which gradually occupy the nuclei and are responsible for the generation of reactive oxygen species. In response to their formation, our transcriptome analysis reveals a cerebellum-specific perturbed protein interaction network, primarily affecting protein synthesis. We propose that insoluble polyQ IIBs cause oxidative and nucleolar stress and affect the assembly of the ribosome by capturing or down-regulating essential components. The inducible cell system can be utilized to decipher the cellular consequences of polyQ protein aggregation. Our strategy provides a broadly applicable methodology for studying polyQ diseases.
- M. Tsagiopoulou, V. Chapaprieta, Duran-Ferrer Martı́, T. Moysiadis, F. Psomopoulos, P. Kollia, N. Papakonstantinou, E. Campo, K. Stamatopoulos, and J. I. Martin-Subero, “Chronic lymphocytic leukemias with trisomy 12 show a distinct DNA methylation profile linked to altered chromatin activation,” Haematologica, 2020, doi: 10.3324/haematol.2019.240721.
Chronic lymphocytic leukemia (CLL) is a neoplasm derived from mature B cells showing a broad spectrum of clinico-biological features.1 The landscape of genetic alterations of CLL is well characterized2 and found to be extremely heterogeneous, with multiple chromosomal aberrations and dozens of driver genes mutated in relatively small proportions of the cases.3,4 In spite of this heterogeneity, four cytogenetic alterations, i.e., del(13q) (>50% of the patients), del(11q) (18%), +12 (16%), and less frequently del(17p) (7%), are collectively detected in at least 80% of patients.1 These copy number changes are part of the routine risk assessment of CLL, as they are robustly associated with treatment choices and the clinical course of the patients. At one end of the prognostic spectrum, the isolated del(13q) is related to favorable prognosis, +12 with intermediate prognosis, del(11q) with poor prognosis and del(17p) with the worst prognosis of all groups. This latter subgroup identifies patients with particular resistance to chemoimmunotherapy who, instead, benefit considerably from biological agents
- A. Agathangelidis, C. Galigalidou, L. Scarfò, T. Moysiadis, A. Rovida, E. Vlachonikola, E. Sofou, F. Psomopoulos, A. Vardi, P. Ranghetti, A. Siorenta, A. Galanis, K. Stamatopoulos, A. Chatzidimitriou, and P. Ghia, “High-throughput analysis of the T cell receptor gene repertoire in low-count monoclonal B cell lymphocytosis reveals a distinct profile from chronic lymphocytic leukemia,” Haematologica, 2020, doi: 10.3324/haematol.2019.221275.
Monoclonal B-cell lymphocytosis (MBL) is an asymptomatic condition of monoclonal B-cell expansions in the blood of healthy, mostly elderly, individuals.MBL is classified into three distinct subtypes: (i) “chronic lymphocytic leukemia (CLL)-like” MBL (CD5+CD23+), which accounts for the vast majority of cases; (ii) “atypical CLLlike” MBL (CD5+CD23-CD20brigh; and (iii) “non CLL-like” MBL (CD5–).3 “CLL-like” MBL is subdivided into two different categories based on clonal size; cases with 0.5-5x109 cells/L are categorized as “high-count MBL” (HCMBL), whereas those with <0.5x109 cells/L as ”low-count MBL” (LC-MBL).4 HC-MBL progresses to CLL requiring treatment at a rate of 1-2% per year, whereas the risk of progression for “CLL-like” LC-MBL is negligible despite persisting over time.
- E. Gavriilaki et al., “Pretransplant Genetic Susceptibility: Clinical Relevance in Transplant-Associated Thrombotic Microangiopathy,” Thrombosis and Haemostasis, vol. 120, no. 04, pp. 638–646, 2020, doi: 10.1055/s-0040-1702225.
Transplant-associated thrombotic microangiopathy (TA-TMA) is a life-threatening complication of allogeneic hematopoietic cell transplantation (HCT). We hypothesized that pretransplant genetic susceptibility is evident in adult TA-TMA and further investigated the association of TMA-associated variants with clinical outcomes. We studied 40 patients with TA-TMA, donors of 18 patients and 40 control non-TMA HCT recipients, without significant differences in transplant characteristics. Genomic DNA from pretransplant peripheral blood was sequenced for TMA-associated genes. Donors presented significantly lower frequency of rare variants and variants in exonic/splicing/untranslated region (UTR) regions, compared with TA-TMA patients. Controls also showed a significantly lower frequency of rare variants in ADAMTS13, CD46, CFH, and CFI. The majority of TA-TMA patients (31/40) presented with pathogenic or likely pathogenic variants. Patients refractory to conventional treatment (62%) and patients that succumbed to transplant-related mortality (65%) were significantly enriched for variants in exonic/splicing/UTR regions. In conclusion, increased incidence of pathogenic, rare and variants in exonic/splicing/UTR regions of TA-TMA patients suggests genetic susceptibility not evident in controls or donors. Notably, variants in exonic/splicing/UTR regions were associated with poor response and survival. Therefore, pretransplant genomic screening may be useful to intensify monitoring and early intervention in patients at high risk for TA-TMA.
- M. T. Kotouza, F. E. Psomopoulos, and P. A. Mitkas, “A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures,” Journal of Cloud Computing, vol. 9, no. 2, pp. 1–17, 2020, doi: 10.1186/s13677-019-0150-y.
Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.
2019
- A.-L. Lamprecht, L. Garcia, M. Kuzak, C. Martinez, R. Arcila, E. Martin Del Pico, V. Dominguez Del Angel, S. van de Sandt, J. Ison, P. A. Martinez, P. McQuilton, A. Valencia, J. Harrow, F. Psomopoulos, J. L. Gelpi, N. Chue Hong, C. Goble, and S. Capella-Gutierrez, “Towards FAIR principles for research software,” Data Science, vol. 2, no. 2, pp. 1–23, 2019, doi: 10.3233/DS-190026.
The FAIR Guiding Principles, published in 2016, aim to improve the findability, accessibility, interoperability and reusability of digital research objects for both humans and machines. Until now the FAIR principles have been mostly applied to research data. The ideas behind these principles are, however, also directly relevant to research software. Hence there is a distinct need to explore how the FAIR principles can be applied to software. In this work, we aim to summarize the current status of the debate around FAIR and software, as basis for the development of community-agreed principles for FAIR research software in the future. We discuss what makes software different from data with regard to the application of the FAIR principles, and which desired characteristics of research software go beyond FAIR. Then we present an analysis of where the existing principles can directly be applied to software, where they need to be adapted or reinterpreted, and where the definition of additional principles is required. Here interoperability has proven to be the most challenging principle, calling for particular attention in future discussions. Finally, we outline next steps on the way towards definite FAIR principles for research software.
- M. Kuzak, J. Harrow, P. A. Martinez, F. E. Psomopoulos, and A. Via, “ELIXIR Europe on the Road to Sustainable Research Software,” Biodiversity Information Science and Standards, vol. 3, p. e37677, 2019, doi: 10.3897/biss.3.37677.
ELIXIR (ELIXIR Europe 2019a) is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR’s activities are divided into the following five areas: Data, Tools, Interoperability, Compute and Training, each known as “platform”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources. The Software Development Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, and promoting information standards and best practices relevant to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software (Jiménez et al. 2017) and the Top 10 metrics for recommended life science software practices (Artaza et al. 2016). The 4OSS simple recommendations are as follows: (1) Develop a publicly accessible open source code from day one, (2) Make software easy to discover by providing software metadata via a popular community registry, (3) Adopt a license and comply with the licenses of third-party dependencies, and (4) Have clear and transparent contribution, governance and communication processes. In order to encourage researchers and developers to adopt the 4OSS recommendations and build FAIR (Findable, Accessible, Interoperable and Reusable) software, the best practices group, in partnership with the ELIXIR Training platform, The Carpentries (Carpentries 2019, ELIXIR Europe 2019b), and other communities, are creating a collection of training materials (Kuzak et al. 2019). The next step is to adopt, promote, and recognise these information standards and best practices. The group will address this by (i) developing comprehensive guidelines for software curation, (ii) through training researchers and developers towards the adoption of software best practices and (iii) improvement of the usability of Tools Platform products. Additionally, a direct outcome of this task will be a software management plan template, connected to a concise description of the guidelines for open research software; and production of a white paper for the software development management plan for ELIXIR, which can be consequently used to produce training materials. We will work with the newly formed ReSA (Research Software Alliance) to facilitate the adoption of this plan for the broader community.
- F. F. Parlapani, S. Michailidou, D. A. Anagnostopoulos, S. Koromilas, K. Kios, K. Pasentsis, F. Psomopoulos, A. Argiriou, S. A. Haroutounian, and I. S. Boziaris, “Bacterial communities and potential spoilage markers of whole blue crab (Callinectes sapidus) stored under commercial simulated conditions,” Food Microbiology, vol. 82, pp. 325–333, 2019, doi: 10.1016/j.fm.2019.03.011.
Bacterial communities composition using 16S Next Generation Sequencing (NGS) and Volatile Organic Compounds (VOCs) profile of whole blue crabs (Callinectes sapidus) stored at 4 and 10 °C (proper and abuse temperature) simulating real storage conditions were performed. Conventional microbiological and chemical analyses (Total Volatile Base-Nitrogen/TVB-N and Trimethylamine-Nitrogen/TMA-N) were also carried out. The rejection time point was 10 and 6 days for the whole crabs stored at 4 and 10 °C, respectively, as determined by development of unpleasant odors, which coincided with crabs death. Initially, the Aerobic Plate Count (APC) was 4.87 log cfu/g and increased by 3 logs at the rejection time. The 16S NGS analysis of DNA extracted directly from the crab tissue (culture-independent method), showed that the initial microbiota of the blue crab mainly consisted of Candidatus Bacilloplasma, while potential pathogens e.g. Listeria monocytogenes, Pseudomonas aeruginosa and Acinetobacter baumannii, were also found. At the rejection point, bacteria of Rhodobacteraceae family (52%) and Vibrio spp. (40.2%) dominated at 4 and 10 °C, respectively. TVB-N and TMA-N also increased, reaching higher values at higher storage temperature. The relative concentrations of some VOCs such as 1-octen-3-ol, trans-2-octenal, trans,trans-2,4-heptadienal, 2-butanone, 3-butanone, 2-heptanone, ethyl isobutyrate, ethyl acetate, ethyl-2-methylbutyrate, ethyl isovalerate, hexanoic acid ethyl ester and indole, exhibited an increasing trend during crab storage, making them promising spoilage markers. The composition of microbial communities at different storage temperatures was examined by 16S amplicon meta-barcoding analysis. This kind of analysis in conjugation with the volatile profile can be used to explore the microbiological quality and further assist towards the application of the appropriate strategies to extend crab shelf-life and protect consumer’s health.
- A. M. Kintsakis, F. E. Psomopoulos, and P. A. Mitkas, “Reinforcement Learning based scheduling in a workflow management system,” Engineering Applications of Artificial Intelligence, vol. 81, pp. 94–106, 2019, doi: 10.1016/j.engappai.2019.02.013.
Any computational process from simple data analytics tasks to training a machine learning model can be described by a workflow. Many workflow management systems (WMS) exist that undertake the task of scheduling workflows across distributed computational resources. In this work, we introduce a WMS that leverages machine learning to predict workflow task runtime and the probability of failure of task assignments to execution sites. The expected runtime of workflow tasks can be used to approximate the weight of the workflow graph branches in respect to the total workflow workload and the ability to anticipate task failures can discourage task assignments that are unlikely to succeed. We demonstrate that the proposed machine learning models can lead to significantly more informed scheduling decisions that minimize task failures and utilize execution sites more efficiently, thus leading to reduced workflow runtime. Additionally, we train a modified sequence-to-sequence neural network architecture via reinforcement learning to perform scheduling decisions as part of a WMS. Our approach introduces a WMS that can drastically improve its scheduling performance by independently learning over time, without external intervention or reliance on any specific heuristic or optimization technique. Finally, we test our approach in real-world scenarios utilizing computationally demanding and data intensive workflows and evaluate its performance against existing scheduling methodologies traditionally used in WMSes. The performance evaluation outcome confirms that the proposed approach significantly outperforms the other scheduling algorithms in a consistent manner and achieves the best execution runtime with the lowest number of failed tasks and communication costs.
- M. Wu, F. Psomopoulos, S. J. Khalsa, and A. de Waard, “Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories,” Data Science Journal, vol. 18, no. 1, p. 13, 2019, doi: 10.5334/dsj-2019-003.
As data repositories make more data openly available it becomes challenging for researchers to find what they need either from a repository or through web search engines. This study attempts to investigate data users’ requirements and the role that data repositories can play in supporting data discoverability by meeting those requirements. We collected 79 data discovery use cases (or data search scenarios), from which we derived nine functional requirements for data repositories through qualitative analysis. We then applied usability heuristic evaluation and expert review methods to identify best practices that data repositories can implement to meet each functional requirement. We propose the following ten recommendations for data repository operators to consider for improving data discoverability and user’s data search experience: 1. Provide a range of query interfaces to accommodate various data search behaviours. 2. Provide multiple access points to find data. 3. Make it easier for researchers to judge relevance, accessibility and reusability of a data collection from a search summary. 4. Make individual metadata records readable and analysable. 5. Enable sharing and downloading of bibliographic references. 6. Expose data usage statistics. 7. Strive for consistency with other repositories. 8. Identify and aggregate metadata records that describe the same data object. 9. Make metadata records easily indexed and searchable by major web search engines. 10. Follow API search standards and community adopted vocabularies for interoperability.
- M. Kuzak, J. Harrow, A. Via, and F. Psomopoulos, “ELIXIR tools platform: software best practices,” Jul. 2019, doi: 10.7490/f1000research.1117050.1.
ELIXIR is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR’s activities are divided into the following five areas Data, Tools, Interoperability, Compute and Training known as “platforms”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources.Software Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, promoting and measuring information standards and best practices applied to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software and the T op 10 metrics for life science software good practices . The next step is to adopt, promote, and recognize these information standards and best practices, by developing comprehensive guidelines for software curation, and through workshops for training researchers and developers towards the adoption of software best practices and improvement of the usability of Tools Platform products. Additionally, a direct outcome of this task will be a software management plan template, connected to a concise description of the guidelines for open research software. Produce a white paper for the software development management plan for ELIXIR which can be consequently used to produce training material.
Conferences and Announcements
2024
- “ELIXIR Training Platform: SPLASH – Skills, Professional development, Learning Assessment, Support and Help ,” May 2024, doi: 10.7490/f1000research.1119716.1.
The ELIXIR Training Platform (TrP) is a key infrastructure of Europe’s bioinformatics training landscape, aiming to strengthen national training programmes, grow bioinformatics training capacity and competence across Europe and beyond. Over the past decade, the TrP has actively collaborated with partners and experts to establish best practices, tools and standards, resulting in consistent improvement in the training quality and capacity across ELIXIR members. In its new work programme, the TrP aims to consolidate these resources and achievements, intensifying support for stakeholders, and disseminating these resources more effectively. This initiative provides several benefits for stakeholders, offering strategic advantages by aligning training strategies across ELIXIR Communities, fostering cohesion and synergy to optimise training efforts and advance bioinformatics education and research across Europe. Central to this is SPLASH, a new digital hub, built around the training lifecycle that embraces the whole ELIXIR training ecosystem. It guides training stakeholders through planning, designing, delivering, and evaluating training, all essential elements for fortifying a robust training strategy. It will showcase training resources and projects in ELIXIR, such as: Training eSupport System (TeSS) portal, disseminating training events and materials. ELIXIR-GOBLET Train-the-Trainer programme, building capacity in training skills. Learning Paths, spearheading the development of structured learning programmes. Training Metrics Database (TMD) , providing training impact metrics. FAIR Training, spearheading the implementation of FAIR principles in training. Training Certification, establishing a certification process for training. ELIXIR Training Lesson Template , ELIXIR’s template for authoring and publishing lessons. E-learning, providing best practices for e-learning. ELIXIR-SI eLearning Platform (EeLP), providing e-learning management systems.
- A. Orfanou, V. Vasileiou, G. Gavriilidis, M. Van Baardwijk, N. Ishaque, A. Stubbs, and F. Psomopoulos, “Poster: Enhancing Perturbation Modeling in Single-Cell Data through Advanced Deep Learning Approaches — Awarded 1st Place, Best Poster Award at the 23rd European Conference on Computational Biology (ECCB2024), Turku, Finland, 16–20 September 2024,” 2024, doi: 10.5281/ZENODO.17051956.
This poster was presented at the 23rd European Conference on Computational Biology (ECCB2024), held in Turku, Finland, 16–20 September 2024, hosted by the University of Turku and CSC – IT Center for Science. The contribution received recognition as Best Poster Award (1st Place) at ECCB2024. Abstract Perturbation modelling in single-cell data is crucial for studying molecular changes elicited due to molecular knockouts, chemical compounds, and biological stimulants across health and disease phenotypes. Perturbation modelling is confounded by scarce biological explainability, statistical uncertainty in Deep Learning (DL) predictions in extreme perturbation scenarios, hyperparameter optimization, and limited scalability to multi-omic single-cell data. We present the Mongoose project (Multi-Objective Network Generator Of Optimized Single-cell Experiments) to explore enhancing perturbation modelling in complex single-cell datasets through Multi-Task Learning (MTL). Mongoose combines (i) UnitedNet, a DL framework that employs MTL to simultaneously perform joint group identification and cross-modal prediction with Sharpley values (SHAP) as an explainability component and (ii) perturbation modelling tools like SCING and GenKI, which reconstruct and perturb cell type-specific gene-regulatory networks (GRNs). UnitedNet contains an encoder-decoder-discriminator structure which approximates the statistical characteristics of each modality without prior assumptions. Hence, we claim that UnitedNet can facilitate biologically informed decisions on conducting ensuing digital KOs across reverse-engineered GRNs by providing (i) better cell-type clusters and (ii) SHAP values of significant cross-modal/cell-type associations from complex single-cell and spatial omic datasets. We will showcase the Mongoose approach on complex multi-omic mRNA/protein datasets like the Perturb-CITE-seq (CRISPR knock-outs) and the spatial DBiT-seq mouse embryo dataset, where mRNA-protein associations and spatial niche identification are expected to play pivotal roles in perturbations like in-silico GRN digital KOs. We anticipate that Mongoose will provide perturbational insights closer to ground truths, ultimately highlighting critical transcription factors and signalling pathways with potential translational value.
- V. Vasileiou, G. Gavriilidis, M. Mraz, P. Zeni, A. Giannakakis, G. Pavlopoulos, and F. Psomopoulos, “Biologically informed Deep Learning graph-based framework for unveiling critical lncRNAs in CLL,” 2024, doi: 10.5281/ZENODO.15294356.
Transcriptomic analyses elucidate complexities in diseases like Chronic Lymphocytic Leukemia (CLL), revealing specific molecules, cellular traits, and signalling patterns influencing clinical outcomes. The status of the variable region of IGHV in leukemia cells is crucial for classifying CLL cases, providing insights into their clinical condition and prognosis. Patients can be stratified based on the mutational status of the IGHV genes, with mutated CLL (M-CLL) exhibiting a more indolent course than the more aggressive unmutated CLL (U-CLL) cases. Long non-coding RNAs (lncRNAs) play diverse roles in cancer but are often overlooked due to limitations in computational approaches. To address this, we propose a biologically informed Deep Learning model-based framework for supervised patient classification, instructed by lncRNA-mRNA-pathways interactions with eXplainable Artificial Intelligence (XAI) read-outs. This pipeline utilizes NetBID2 for network-based integration of omics data, and PASNet for the deep learning analysis, further building on lncrnalyzer and EnrichR knowledge graphs as input for the biological priors. Complementing these, we deploy custom code integrating SHAP and heatmaps for interpretability, by combining a bespoke graph architecture representing lncRNA-mRNA-pathway interactions, facilitating a comprehensive analysis of regulatory mechanisms. The proposed framework is being tested by applying it into two extensive CLL transcriptomic datasets; data from ICGC-CLLES project (257 samples) are used as the training dataset, with the produced model validated on the BloodCancerMultiOmics2017 dataset (113 samples). Preliminary results showed that U-CLL cases are easily differentiated from M-CLL cases as expected, additionally revealing both biological ground-truth pathways, as well as insofar unknown but potential prognostic lncRNAs in disease.
- N. Pechlivanis, A. Anastasiadou, A. Papageorgiou, E. Pafilis, and F. Psomopoulos, “Odyssey: an Interactive R Shiny App Approach to explore Molecular Biodiversity in Greece,” Sep. 2024, doi: 10.5281/ZENODO.14186452.
Sustainable development and ecological protection depend on an understanding of molecular biodiversity. Greece, being one of the hotspots for European biodiversity, has a wealth of genomic data that is just waiting to be explored. However, many researchers find it extremely difficult to access traditional techniques of data analysis
- S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. Psomopoulos, “Synth4bench: generating synthetic genomics data for the evaluation of somatic variant callers,” Sep. 2024, doi: 10.5281/ZENODO.14186509.
Identifying cancer-related genomic alterations with algorithms is vital, but evaluating their accuracy is challenging due to the lack of high-quality datasets. To address this issue, we implemented synth4bench, a framework, which calls the NEATv3.3 simulator to generate synthetic genomics data, calls somatic variant callers and then benchmarks their results
- M. Gerousi, M. Fitopoulou, M. Galanou, P. Rock, G. Karakatsoulis, A. Anastasiadou, M. Karipidou, A. Iatrou, E. Karamanli, N. Vastarouchas, P. Ghia, C. C. Chu, C. S. Zent, R. Burack, A. Chatzidimitriou, and K. Stamatopoulos, “Distinct Signaling Profiles in Primary Splenic Small B-Cell Lymphomas,” Nov. 2024, doi: 10.1182/blood-2024-207057.
Splenic small B-cell lymphomas encompass a diverse group of rare entities that originate in or significantly involve the spleen, typically accompanied by bone marrow and peripheral blood involvement. Despite significant advancements in understanding their biology, the role of microenvironmental signals in shaping clonal behavior and clinical outcomes remains poorly understood, largely due to the scarcity of relevant primary patient samples. Here, we investigated for the first time the signaling capacity, both at the basal state and after triggering with microenvironmental stimuli, in primary splenic biopsy specimens of patients with splenic marginal zone lymphoma (SMZL; n=31), splenic diffuse red pulp lymphoma (SDPRL; n=7) and hairy cell leukemia variant (HCLv; n=10). Flow cytometric analysis of TLR1-10 expression in single-cell suspensions from splenectomy samples revealed distinct patterns, ranging from uniformly high in HCLv, to intermediate in SDRPL to generally lower yet variable expression in SMZL with significant differences between SMZL vs HCLv (i.e. TLR1, FD=8.6, p<0.001) and SDRPL vs HCLv (i.e. TLR5, FD=1.7, p<0.05). TLRs 2, 8 and 10 exhibited the lowest expression, whereas, TLR7 was highly expressed in all entities (>98% TLR7+ cells). Flow cytometry also disclosed: i) increased cell proliferation rate in HCLv vs SMZL (FD=6, p<0.01), assessed by Ki67 expression; ii) enhanced expression of the CD86 activation marker in HCLv vs SMZL (FD=3.2, p<0.01); and, iii) increased expression of the CD25 activation marker in SMZL vs either HCLv or SDRPL (FD=6.3 and FD=6.8, p<0.05, respectively). Next, we examined the functional capacity of TLRs after stimulation with specific ligands for the TLR1/2 and TLR2/6 heterodimer, TLR4 and TLR9 (Pam3CSK4, FSL-1, LPS and CpG, respectively). Both CD25 and CD86 were significantly upregulated in response to FSL-1, CpG and LPS stimulation in SMZL (FD=1.23-1.81, p<0.01), while only CD25 was upregulated in HCLv (FD=2.1, p<0.05); SDRPL was unaffected. In all entities, TLR triggering also resulted in enhanced proliferation of splenic B cells compared to unstimulated control cells, albeit in a heterogeneous manner. In particular, i) all cases strongly responded to FSL-1, particularly HCLv and SDRPL (FD=3.1, p<0.01 and FD=5.75, p<0.05, respectively); ii) both HCLv and SDRPL showed a moderate response to PAM3CSK4 (FD=1.2 and FD=2, p<0.05, respectively); and, iii) all cases showed augmented proliferation after triggering with CpG, particularly SMZL and SDRPL (FD=11.3, p<0.01 and FD=7.1, p<0.05, respectively). Cell viability, assessed by Annexin V, was markedly augmented in HCLv after CpG and LPS stimulation (FD=1.21 and FD=1.31, p<0.05, respectively), whereas, it was unaffected in SDRPL and SMZL. Following, we assessed signaling capacity after both isolated and combined BcR and TLR9 stimulation. SMZL cells displayed increased ERK and NF-κB phosphorylation after double stimulation (FD=1.44, p<0.01 and FD=1.96, p<0.05, respectively), while single BcR stimulation also induced ERK and NF-κB phosphorylation (FD=1.23, p<0.05 and FD=1.43, p=0.1, respectively), albeit to a lesser extent. Elevated pERK levels were also found in HCLv after single BcR and double BcR/TLR9 stimulation (FD=1.33 and FD=1.53, p<0.01, respectively). On the contrary, SDRPL cells did not present any change in phosphorylation status. Our prior studies in circulating SMZL B cells have implicated the histone methyltransferase EZH2 in the response to microenvironmental triggering. Here, using flow cytometry, we found that: i) EZH2 was expressed in all entities, albeit significantly higher in SDRPL and HCLv compared to SMZL (FD=1.64, p<0.01 and FD=1.83, p<0.001, respectively); ii) H3k27me3, the main target of EZH2, was highly ( 95%) and uniformly expressed in all entities, indicating active EZH2; iii) BcR/TLR9 triggering resulted in significant upregulation of EZH2 expression in SMZL (FD=2.57, p<0.01); and, iv) neither SDRPL nor HCLv displayed significantly altered EZH2 expression after co-stimulation. Taken together, signaling through the BcR and TLRs is functional in splenic small B cell lymphomas, leading to activation of downstream pathways and increased proliferation. Nevertheless, each entity presents a particular signaling capacity with SMZL, in particular, appearing rather distinct from SDPRL and HCLv in terms of responsiveness to external triggering.
2023
- M. Gerousi, G. Gavriilidis, S. Keisaris, A. Kourouni, A. Orfanou, A. Iatrou, A. Pseftogkas, G. Mosialos, E. Theodosiou, A. Chatzidimitriou, F. Psomopoulos, P. Ghia, K. Stamatopoulos, and K. Xanthopoulos, “The Deubiquitinase CYLD Acts As an Oncogene in a Cellular Model of Chronic Lymphocytic Leukemia,” in Blood, Nov. 2023, vol. 142, no. Supplement 1, pp. 3265–3265, doi: 10.1182/blood-2023-188983.
The cylindromatosis protein (CYLD) is a functional deubiquitinase that regulates critical signaling pathways, e.g. NF-κB and Wnt, thus modulating several cellular functions. CYLD acts as a tumor suppressor gene in solid tumors, and is also involved in the pathogenesis of hematological malignancies, including B cell lymphomas, in as yet unclear ways. In chronic lymphocytic leukemia (CLL), preliminary evidence suggests that reduced expression of CYLD correlates with a worse clinical prognosis, which appears to be in line with its postulated role as a tumor suppressor. Here we sought to gain insight into the function of CYLD in CLL through genetic engineering, molecular characterization and bioenergetic profiling. To this end, we used CRISPR/Cas9 technology and a CYLD-targeting or an unrelated gRNA to generate stable CYLD-knockout ( CYLDko) and control ( CYLDwt) MEC1 cells. Phenotypic characterization of CYLDkoversus CYLDwt MEC1 cells by flow cytometry showed (i) significantly reduced viability, assessed by Annexin V (fold difference, FD=1.2, p<0.05); (ii) lower cell proliferation rate, assessed by Ki67 expression (FD=1.4, p<0.05); (iii) increased apoptosis, determined by measuring active caspase 3 expression levels (FD=6.1, p<0.01); and, (iv) reduced expression of CD86 (FD=4.3, p<0.001) and CD40 (FD=4.1, p=0.05). Western blotting analysis of CYLDko versus CYLDwtMEC1 cells revealed down-regulation of the NF-κB pathway evidenced by diminished expression of ΙΚΚβ (FD=2.4, p<0.01), phospho-ΙκΒα (FD=3.4, p<0.01) and phospho-p105 (FD=2.5, p=0.2); and, the Wnt pathway evidenced by reduced β-catenin levels (FD=2, p<0.05). Transcriptome profiling by RNA-seq gave concordant results, in documenting increased apoptosis and decreased NF-κΒ signaling in CYLDkoversus CYLDwtMEC1 cells. The former also showed downregulation of calcium, BcR and PI3K/AKT/mTOR signaling pathways and, in contrast, upregulation of the glutathione and KEAP1-NFE2L2 pathways, which contribute to antioxidant defense and nutrient metabolism, as well as, regulation of redox balance and cellular metabolism, respectively. To evaluate the impact of CYLD deletion on CLL bioenergetics, we assessed ATP production rate, as a marker of active cellular metabolism, using the Seahorse XF Real-Time ATP Rate assay. CYLDkoMEC1 cells exhibited impaired ATP production reflected in decreased OCR (oxygen consumption rate) and ECAR (extracellular acidification rate), that are proportional to OXPHOS (oxidative phosphorylation) and glycolysis, respectively. Regarding the metabolic phenotype, CYLDwtMEC1 cells displayed a shift towards mitochondrial OXPHOS, whereas, in CYLDkoMEC1 cells ATP production was mostly based on glycolysis. High-performance liquid chromatography was performed using culture supernatants obtained at sequential time points to quantify glucose uptake and lactate secretion rates. The analysis revealed that CYLDwt MEC1 cells primarily directed glucose carbon toward biomass biosynthetic pathways, as reflected in both higher cell number and proliferation rate achieved, while CYLD deletion redirected carbon flux enhancing lactate formation. In both cases, once glucose was consumed (day 4), the secreted lactate was re-used, yet this was more pronounced in CYLDko MEC1 cells. Finally, we explored whether CYLD knockout might impact MEC1 sensitivity to targeted agents i.e. the BTK inhibitor ibrutinib and the BCL2 inhibitor venetoclax. We found that CYLDkoMEC1 cells presented increased apoptosis compared to their CYLDwtcounterparts when cultured in the presence of either drug. Moreover, treatment with ibrutinib or venetoclax led to reduced ATP production rates in both CYLDkoand CYLDwtMEC1 cells, albeit the reduction was more pronounced in the former (FD=4.5 for ibrutinib; FD=10 for venetoclax compared to the respective CYLDwt MEC1 treated cells). Taken together, we demonstrate for the first time that CYLD can also act as an oncogene, at least in the context of CLL and in particular in the MEC1 cell line model of CLL, since its elimination leads to (i) lower proliferation and increased apoptosis rates coupled with diminished signaling capacity; (ii) metabolic rewiring toward enhanced lactate formation; and, (iii) augmented sensitivity to CLL therapeutic agents. It remains to be elucidated under which condition this could also occur in the patients with CLL or other B lymphoproliferative disorders.
- G. Gkoliou, N. Pechlivanis, S. Chatzileontiadou, C. Xydopoulou, C. Frouzaki, G. Karakatsoulis, E. Vlachonikola, M. Gerousi, F. Psomopoulos, A. Siorenta, M. Papaioannou, K. Chlichlia, K. Stamatopoulos, E. Hatjiharissi, and A. Chatzidimitriou, “P835: IN SILICO PREDICTION REVEALS PUTATIVE T-CELL CLASS I/II NEOEPITOPES WITHIN THE CLONOTYPIC IMMUNOGLOBULIN HEAVY AND LIGHT CHAINS IN PATIENTS WITH MULTIPLE MYELOMA,” in HemaSphere, Aug. 2023, vol. 7, no. S3, p. e2734671, doi: 10.1097/01.hs9.0000970244.27346.71.
Several T cell defects have been identified in multiple myeloma (MM), particularly pertaining to T cell exhaustion and senescence, likely as a result of chronic antigenic stimulation. Arguably, therefore, antigenic stimulation is highly relevant for shaping the T cell repertoire in MM, however the nature of the implicated antigens remains elusive. Similar to other mature B cell malignancies, immunoglobulin (IG) gene rearrangements in MM lead to the expression of unique, novel peptide sequences that can serve as neoantigens for T cells.
- A. Iatrou, E. Sofou, E. Kotroni, L. Ann Sutton, M. Frenquelli, R. Sandaltzopoulos, I. Sakellari, N. Stavrogianni, F. Psomopoulos, P. Ghia, R. Rosenquist, A. Agathangelidis, A. Chatzidimitriou, and K. Stamatopoulos, “P605: IMMUNOGENETICS AND ANTIGEN REACTIVITY PROFILING CONTRIBUTE TO UNRAVELLING THE ONTOGENY OF CLL STEREOTYPED SUBSET #4,” in HemaSphere, Aug. 2023, vol. 7, no. S3, p. e7074446, doi: 10.1097/01.hs9.0000969324.70744.46.
CLL subset #4 is the largest stereotyped subset in IGHV-mutated CLL (M-CLL). The clonotypic B cell receptor immunoglobulin (BcR IG) in subset #4, encoded by the IGHV4-34/IGKV3-20 gene pair, displays long heavy complementarity determining region 3 (VH CDR3), enriched in positively charged residues; ubiquitous expression of gamma heavy chain isotype; distinctive imprint of somatic hypermutation (SHM), characterized by the frequent introduction of acidic residues and pronounced intraclonal diversification. These features are reminiscent of edited autoantibodies, implicating ongoing (auto)antigen interactions in the natural history of CLL subset #4.
- G. Gavriilidis, S.-C. Fragkouli, E. Theodosiou, V. Vasileiou, S. Keisaris, and F. Psomopoulos, “SCell-wise fluxomics of Chronic Lymphocytic Leukemia single-cell data reveal novel metabolic adaptations to Ibrutinib therapy, 31st Conference in Intelligent Systems For Molecular Biology and the 22nd European Conference On Computational Biology (ISΜB-ECCB23) ,” Jul. 2023, doi: TBA.
- S.-C. Fragkouli, N. Pechlivanis, A. Agathangelidis, and F. Psomopoulos, “Synthetic Genomics Data Generation and Evaluation for the Use Case of Benchmarking Somatic Variant Calling Algorithms,” Jul. 2023, doi: 10.7490/f1000research.1119575.1.
Somatic variant calling algorithms are widely used to detect genomic alterations associated with cancer. However, evaluating the performance of these algorithms can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic genomics data generation and evaluation framework for benchmarking somatic variant calling algorithms. We generated synthetic datasets based on data from the TP53 gene, using the NEAT simulator. We then thoroughly evaluated the performance of GATK-Mutect2 on these datasets, and compared the results to the bam files produced by NEAT that contain the true variations. Our results demonstrate that the synthetic datasets generated using our framework can accurately capture the complexity and diversity of real cancer genome data. Moreover, the synthetic datasets provide an excellent ground truth for evaluating the performance of somatic variant calling algorithms. Our framework provides a valuable resource for testing the performance of somatic variant callers, enabling researchers to evaluate and improve the accuracy of these algorithms for cancer genomics applications.
2022
- V. Vasileiou, G. Gavriilidis, A. Giannakakis, and F. Psomopoulos, “Network-based Bayesian inference revealed critical ncRNAs signal drivers from transcriptomics in CLL,” 2022, doi: 10.6084/M9.FIGSHARE.22304044.V1.
The purpose of this study is to create a novel workflow based on NetBID tool (Network-based Bayesian inference of signal drivers), in order to unveil ncRNA-gene associations with biological merit in a type of multi-factorial disease like B cell lymphomas (Chronic Lymphocytic Leukemia). Preliminary analysis of BloodCancerMultiOmics2017 database reveals central ncRNAs drivers in CLL patients in comparison with samples that differentiate in drug response with Ibrutinib. These initial findings hopefully pave the way to discover through a Systems-approach more ncRNAs that participate in CLL pathobiology and could be of therapeutic interest.
- Styliani-Christina Fragkouli, A. Agathangelidis, and F. E. Psomopoulos, “Shedding Light on Somatic Variant Calling,” 2022, doi: 10.13140/RG.2.2.12701.18402.
This work aims at shedding light in somatic variant calling. Due to the nature of low frequency somatic mutations, their detection has proven quite onerous. As a first step, an attempt was made to benchmark currently available variant callers and compare their performance. Our findings strongly point towards the need for the application of best practices and gold standard datasets in order to exploit the full potential of existing algorithms and address the issue at large.
- K. Kyritsis, N. Pechlivanis, V. Vasileiou, A. Magklara, A. Kougioumtzi, P. Dafopoulos, E. Ntzioni, E. Tsarouchi, D. Sakellariou, M. Kotoulas, S. Arampatzis, P. Chatzikamaris, E. Siomou, M. Argyraki, D. Botskaris, I. Talianidis, and F. Psomopoulos, “Poster 3657: GENOPTICS: An Intuitive Platform of Visual Analytics for Integrative Analysis of Large-scale Multi-omics Data,” Sep. 2022, doi: 10.7490/f1000research.1119204.1.
During the past two decades, computational analysis has become paramount for biological research. Advancements in high-throughput methods and computational tools resulted in the generation of large amounts of data from different omics fields (multi-omics), such as genomics, epigenomics, transcriptomics, and metabolomics. This plethora of large scale and diverse omics data is being driven by the understanding that a single-omic type does not provide adequate information and integrative analysis of multi-omics data is optimal to gain sufficiently meaningful insights into the actual biological mechanisms. Although various open-source tools have been developed for this purpose, multi-omics data integration and analysis are still beset by a number of problems, including software compatibility, complex parameter selection and creation of functional pipelines with multiple steps of analyses. In this work we present GenOptics, a novel visual analytics platform that aims to facilitate the integration and subsequent analysis of diverse multi-omics datasets as well as meta-data (e.g., clinical data), through a fully interactive environment. The platform comprises of two separate parts. The first incorporates asynchronous analyses of Next-Generation Sequencing raw data, including RNA-, Whole exome-, and ChIP-Seq, using workflows implemented with the Common Workflow Language (doi:10.6084/m9.figshare.3115156.v2) and Docker containers to automate software installation and confer cross-platform portability. The second part constitutes the analytical platform itself, designed to facilitate the execution of robust bioinformatics analyses by life scientists with minimal or no knowledge of programming. GenOptics constitutes an open source (https://genoptics.github.io/), computational biology platform for novel pattern and biomarker detection.
- K. Kyritsis, G.-N. Kartanos, V. Siarkou, and F. Psomopoulos, “Poster 9113: k-mer and GWAS Approaches to Identify Host-Specific Genomic Determinants in Klebsiella Pneumoniae,” Sep. 2022, doi: 10.7490/f1000research.1119205.1.
Klebsiella pneumoniae is an important Gram-negative opportunistic bacterial pathogen that causes a variety of community and healthcare-associated infections in human and animals. The emergence and spread of multidrug resistant K. pneumoniae strains is now recognized as an urgent threat to public health worldwide. However, the epidemiology of K. pneumoniae has not been extensively studied and reservoirs of the organism have been rarely investigated. Hence, understanding and monitoring of K. pneumoniae transmission across host species is of paramount importance. In this study, we aimed to identify host-associated genomic differences across various K. pneumoniae strains originated from human and animals. Machine Learning (ML) classification models were trained for host prediction on pre-processed 9-mer count data from 706 publicly available whole genomes (European Nucleotide Archive, ENA). Model performance reached over 85% (accuracy and f1-score), showcasing the presence of host-associated differences. Pangenome and genome-wide association analyses (GWAS) were further employed for interpretation. Several acquired accessory genes, implicated in pathways such as carbohydrate metabolism, iron binding, and antibiotic resistance, were identified in agreement with recent studies. Interestingly, we also detected novel host-associated acquired genes related to stress-response. Our results support the application of ML algorithms and k-mers in parallel with GWAS workflow for the epidemiological surveillance of human and zoonotic transmission throughout K. pneumoniae outbreaks. Furthermore, we validate recently reported K. pneumoniae accessory genome variations and present novel ones that could associate with host specificity and/or reflect the selective pressure exerted on commensal and pathogenic bacteria by the excessive use antibiotics.
- G. Gavriilidis, S. Dimitsaki, V. Vasileiou, and F. Psomopoulos, “Biologically informed neural network identifies unfolded proteinrResponse as key pathway in critical COVID-19,” 2022, doi: 10.7490/F1000RESEARCH.1119199.1.
COVID-19 multi-omics have been thoroughly analyzed with machine learning; however, the inference of therapeutically actionable insights has been hampered by highly dimensional biological data and poorly interpretable Artificial Intelligence components (10.1136/bmjinnov-2020-000648). Prior biomedical knowledge can significantly enhance neural networks applied on multi-omics since it can constrain the model from exploring unnecessary hypothesis spaces (https://doi.org/10.1093/bib/bbab454). Considering the above, we designed COV-PASnet which combines pathway-associated sparse deep neural network (PASnet) with explainable Artificial Intelligence Shapley values (https://doi.org/10.1371/journal.pone.0231166). COV-PASnet was able to robustly demarcate critical from non-critical COVID-19 cases (AUC:92%, F1-score: 69%) when applied on 2 large plasma proteomic datasets (training MGH: https://doi.org/10.1016/j.xcrm.2021.100287, testing DC: https://doi.org/10.1016/j.cell.2020.10.037). Strikingly, one of the most predictive pathways via pathway layers’ node ranking was the Unfolded Protein Response (UPR), mainly due to highly abundant Death Receptor 5 (DR5/TNFRSF10) (https://doi.org/10.1016/j.molcel.2017.06.017). Since the literature is scant for UPR signaling in coronaviruses, we next analyzed scRNA-seq data from COVID-19-patient-derived peripheral-blood mononuclear cells and discovered UPR-prone plasmablasts through Enrichr (GO:0036500, padj.<0.001) (doi: 10.1002/cpz1.90) and KEA3 (IRE1 kinase: 1st in critical COVID-19, 108th in non-critical cases based on MeanRank metric)( https://doi.org/10.1093/nar/gkab359). Overall, our herein work shows that our biologically informed neural network called COV-PASnet led to the identification of a previously unexplored tenet of critical COVID-19 in circulating plasmablasts called UPR signaling. Considering how important antibody secretion by plasmablasts is for COVID-19 clearance (https://doi.org/10.1016/j.cell.2020.08.025) and that persistent UPR via DR5 promotes apoptosis (10.1126/science.1254312), we believe that this pathway merits further pharmacological investigation, in the search of novel COVID-19 therapeutics.
- N. Pechlivanis, G. Karakatsoulis, S. Sgardelis, I. Kappas, and F. Psomopoulos, “Microbial co-occurrence network reveals climate and geographic patterns for soil diversity on the planet,” Nov. 2022.
Soil microbiota plays an integral role in shaping the overall biodiversity of our planet. Yet systematic investigations on how microbial communities are affected by geographic and climatic factors, especially with the ongoing climate change, are still limited. Previous studies ( 10.1038/ismej.2015.261 ) have tried to explore the microbial diversity in specific geographic areas or were focused more on the underlying microbial interactions rather than the structure changes ( 10.1186/s40168-020-00857-2 ). Here, we explore the effects of key climatic factors and geographic patterns on soil diversity and microbial interactions across the globe. To this end, we have used data from the Earth Microbiome Project (EMP, 10.1038/nature24621 ), as it offers a massive collection of planetary-scale microbiome datasets. In conjunction with the Köppen-Geiger climate classification ( 10.1038/sdata.2018.214 ) and other bioclimatic variables from the WorldClim database ( 10.1002/joc.5086 ), the above dataset assembly can be used to identify important variations in soil diversity. Initial comparisons between different climate classifications revealed statistically significant differences amongst them, highlighting the role that temperature and precipitation play in microbiota presence. In addition, a microbial co-occurrence network was built to capture soil microbial interactions using the SpiecEasi workflow ( 10.1371/journal.pcbi.1004226 ). The resulting network was used to identify hub species in each climatic environment and revealed significant topological shifts on the network level for different climatic regions. Our study of soil diversity in different climate environments contributes to a better understanding of how climatic factors affect its distribution on Earth.
- N. Pechlivanis, A. Mitsigkolas, F. Psomopoulos, and E. Bosdriesz, “Assessing SARS-CoV-2 evolution through the analysis of emerging mutations,” Nov. 2022.
The number of studies on SARS-CoV-2 published on a daily basis is constantly increasing, in an attempt to understand and address the challenges posed by the pandemic in a better way. Most of these studies also include a phylogeny of SARS-CoV-2 as background context, always taking into consideration the latest data in order to construct an updated tree. However, some of these studies have also revealed the difficulties of inferring a reliable phylogeny. [13] have shown that reliable phylogeny is an inherently complex task due to a large number of highly similar sequences, given the relatively low number of mutations evident in each sequence. From this viewpoint, there is indeed a challenge and an opportunity in identifying the evolutionary history of the SARS-CoV-2 virus, in order to assist the phylogenetic analysis process as well as support researchers in keeping track of the virus and the course of its characteristic mutations, and in finding patterns of the emerging mutations themselves and the interactions between them. The research question is formulated as follows: Detecting new patterns of co-occurring mutations beyond the strain-specific / strain-defining ones, in SARS-CoV-2 data, through the application of ML methods.Going beyond the traditional phylogenetic approaches, we will be designing and implementing a clustering method that will effectively create a dendrogram of the involved sequences, based on a feature space defined on the present mutations, rather than the entire sequence. Ultimately, this ML method is tested out in sequences retrieved from public databases and validated using the available metadata as labels. The main goal of the project is to design, implement and evaluate software that will automatically detect and cluster relevant mutations, that could potentially be used to identify trends in emerging variants.
- N. Pechlivanis, T. Maria, M. Maria-Christina, T. Anastasis, M. Evangelia, C. Taxiarchis, C. Serafeim C., P. Maria, K. Margaritis, K. Thodoris, L. Stamatia, V. Elisavet, C. Anastasia, P. Agis, P. Nikolaos, D. Chrysostomos I., A. Anagnostis, and P. Fotis, “lineagespot: Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater,” Jul. 2022, doi: 10.7490/f1000research.1119052.1.
Nearly two years after the first report of SARS-CoV-2 in Wuhan, China, the virus has caused an unprecedented global crisis. The COVID-19 pandemic has affected more the four million people, making it necessary for the emergence of new approaches able to monitor its spread. Most laboratories rely on PCR-based methods for epidemiological investigation and early mutation detection of the virus. Yet, these methods are not easily scalable, especially in large urban areas, or in cases where new mutations arise. Recently, the detection of SARS-CoV-2 RNA in wastewater turned into a useful tool to study the prevalence of the virus in the community. To this end, a novel methodology, called lineagespot, has been developed for the monitoring of mutations and the detection of SARS-CoV-2 lineages in wastewater samples using next-generation sequencing (NGS). The tool accepts as input a VCF file, which contains all nucleotide (and the corresponding amino acid) mutations identified in a sample, along with a file containing all lineage-assignment mutations. After analyzing all inputs, a tab-delimited file (TSV file) is produced containing the identified mutations that are related to SARS-CoV-2 lineages. In addition, it computes the average allele frequencies to quantify lineage abundance metrics. lineagespot is dependent on the source used for retrieving lineage definitions; currently, the package supports two potential sources, i.e. the lineage-characteristic mutation profiles pre-calculated by outbreak.info and the lineage-characteristic mutation profiles derived from the trained Pangolin models. So far, the tool has been evaluated through application to NGS data produced to cover a six-month period for the municipality of Thessaloniki, Greece. The results revealed the presence of SARS-CoV-2 variants in wastewater data, and have been recently published at scientific reports. It is worth noting that lineagespot was able to record the evolution and rapid domination of the Alpha variant (B.1.1.7) in the community, and identified a strong correlation between the mutations evident through our approach and the mutations observed in patients from the same area and time periods. lineagespot has been developed as an open-source tool, implemented in R, and is freely available through GitHub and the Bioconductor repository.
2021
- N. Pechlivanis, A. Togkousidis, M. C. Maniou, M. Tsagiopoulou, and F. Psomopoulos, “Developing a novel feature space for sequence data analysis; a use-case on SARS-CoV-2 data,” 2021, doi: 10.5281/ZENODO.4897477.
Create a novel feature space based on k – mers that can be retrieved from unaligned sequence data. Purpose of these new features is to facilitate the effective application of machine learning algorithms in various scenarios. The method examines all values of k within a user-defined range, starting from lower k-values, assigning scores to k-mers, keeping those of highest scores, and proceeding to higher k – values (Pruning trees).
- D. S. Katz, M. Barker, L. J. Garcia Castro, N. P. Chue Hong, M. Gruenpeter, J. L. Harrow, C. Martinez Ortiz, P. A. Martinez, and F. Psomopoulos, “FAIR Research Software and Science Gateways,” May 2021, doi: 10.5281/zenodo.4923124.
In recent years, the scholarly community has examined its culture and practices, and found a set of overlapping areas in which to improve, including open science (making both the outputs and processes of scholarly research available),reproducibility (increasing trust in scholarly results by making them repeatable by others), and FAIR (making scholarly out-puts, specifically data, findable, accessible, interoperable, and reusable). While the scholarly community is generally supportive of all of these efforts, the degree of support wanes with both the amount of extra work that is needed and the lack of clear details on how to achieve them, along with misaligned incentives. In this lightning talk, we will initially focus on FAIR and the details of how it can be applied to research software. This leads to a number of distinct challenges, including scope (defining both software and research software), principles (defining what find-able, accessible, interoperable, and reusable mean for research software), implementation (developing guidelines and instructions for how to make research software FAIR), and metrics (providing a means to measure the FAIRness of research software).Science gateways include a number of different types of software, for example, the frameworks used to construct the gateways themselves, tools provided by the community that run in the gateway, and software implemented as services with which the gateways interact. The second part of this lightning talk will discuss how FAIR principles for research software apply to each of these types of software common in science gateways.We will close by explaining how members of the science gateway community can become more involved in the FAIR for research software process, to learn, to contribute, or to champion.
2020
- D. S. Katz, M. Barker, N. Chue Hong, L. J. Garcia-Castro, M. Gruenpeter, J. Harrow, M. Kuzak, P. Martinez Villegas, and F. E. Psomopoulos, “Toward defining and implementing FAIR for research software,” in AGU Fall Meeting Abstracts, Dec. 2020, vol. 2020, pp. IN037–01, doi: 10.5281/zenodo.4085311.
As of July 2020, a new FAIR For Research Software Working Group (FAIR4RS WG) is being jointly convened as an Research Data Alliance (RDA) Working Group, FORCE11 Working Group, and Research Software Alliance (ReSA) Taskforce, in recognition of the importance of this work for the advancement of the research sector. FAIR4RS WG will enable coordination of a range of existing community-led discussions on how to define and effectively apply FAIR principles to research software, to achieve adoption of these principles. The working group will deliver: 1) A document developed with community support defining FAIR principles for research software; 2) A document providing guidelines on how to apply the FAIR principles for research software (based on existing frameworks); and 3) A document summarising the definition of the FAIR principles for research software, implementation guidelines and adoption examples. Four initial subgroups are now 1) defining research software, 2) taking a fresh look at the FAIR principles in the context of research software, 3) examining recent work in this area, and 4) looking at how FAIR is being applied to other types of objects. This talk provides an update on the results of these four subgroups, in the context of the entire working group’s activities and plans.
- L. J. Garcia Castro, M. Barker, N. P. Chue Hong, F. Psomopoulos, J. Harrow, D. S. Katz, M. Kuzak, P. A. Martinez, and A. Via, “Software as a first-class citizen in research,” Nov. 2020, doi: 10.4126/FRL01-006423290.
In recent years the importance of software in research has become increasingly recognized by the research community. This journey still has a long way to go. Research data is currently backed by a variety of efforts to implement and make FAIR principles a reality, complemented by Data Management Plans. Both FAIR data principles and management plans offer elements that could be useful for research software but none of them can be directly applied; in both cases there is a need for adaptation and then adoption. In this position paper we discuss current efforts around FAIR for research software that will also support the advancement of Software Management Plans. In turn, use of SMPs encourages researchers to make their datasets FAIR.
2019
- M. T. Kotouza, F. E. Psomopoulos, and P. A. Mitkas, “A Dockerized String Analysis Workflow for Big Data,” in 23rd European Conference on Advances in Databases and Information Systems, ASBIS 2019, Bled, Slovenia, September 8-11, 2019, 2019, pp. 564–569, doi: 10.1007/978-3-030-30278-8_55.
Nowadays, a wide range of sciences are moving towards the Big Data era, producing large volumes of data that require processing for new knowledge extraction. Scientific workflows are often the key tools for solving problems characterized by computational complexity and data diversity, whereas cloud computing can effectively facilitate their efficient execution. In this paper, we present a generative big data analysis workflow that can provide analytics, clustering, prediction and visualization services to datasets coming from various scientific fields, by transforming input data into strings. The workflow consists of novel algorithms for data processing and relationship discovery, that are scalable and suitable for cloud infrastructures. Domain experts can interact with the workflow components, set their parameters, run personalized pipelines and have support for decision-making processes. As case studies in this paper, two datasets consisting of (i) Documents and (ii) Gene sequence data are used, showing promising results in terms of efficiency and performance.
- A. Vardi, E. Vlachonikola, S. Mourati, F. Psomopoulos, N. Pantouloufos, A. Kouvatsi, N. Stavroyianni, A. Anagnostopoulos, K. Stamatopoulos, and A. Hadzidimitriou, “PS1131 High-Throughput B-Cell immunoprofiling at diagnosis and relapse offers further evidence of functional selection throughout the natural history of chronic lymphocytic leukemia,” in HemaSphere, 2019, vol. 3, no. S1, p. 512, doi: 10.1097/01.HS9.0000562808.48237.52.
Chronic lymphocytic leukemia (CLL) is divided into two broad prognostic categories, namely mutated (M) and unmutated (U) CLL, according to the somatic hypermutation (SHM) status of the clonotypic heavy chain immunoglobulin (IGHV) gene. This is perceived to remain stable over time, as evidenced by low-throughput studies, which however precluded investigation of subclonal architecture and evolution overtime. Aims: Here, we sought to comprehensively assess the B cell receptor (BcR) IG gene repertoire at CLL diagnosis and 1st relapse after chemoimmunotherapy (FCR) by next-generation sequencing (NGS).
- K. Gemenetzi, A. Agathangelidis, F. Psomopoulos, K. Pasentsis, E. Koravou, M. Iskas, N. Stavroyianni, A. Anagnostopoulos, R. Sandaltzopoulos, K. Stamatopoulos, and A. Chatzidimitriou, “VH CDR3-Focused Somatic Hypermutation in CLL IGHV-IGHD-IGHJ Gene Rearrangements with 100% IGHV Germline Identity,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 4277–4277, doi: 10.1182/blood-2019-127979.
Classification of patients with chronic lymphocytic leukemia (CLL) based on the immunoglobulin heavy variable (IGHV) gene somatic hypermutation (SHM) status has established predictive and prognostic relevance. The SHM status is assessed based on the number of mutations within the sequence of the rearranged IGHV gene excluding the VH CDR3. This is mostly due to the difficulty in discriminating actual SHM from random nucleotides added between the recombined IGHV, IGHD and IGHJ genes. Hence, this approach may underestimate the true impact of SHM, in fact overlooking the most critical region for antigen-antibody interactions i.e. the VH CDR3. Relevant to mention in this respect, studies from our group in CLL with mutated IGHV genes (M-CLL), particularly subset #4, have revealed considerable intra-VH CDR3 diversity attributed to ongoing SHM.Prompted by these findings, here we investigated whether SHM may also be present in cases bearing ’truly unmutated’ IGHV genes (i.e. 100\% germline identity across VH FR1-VH FR3), focusing on two well characterized stereotyped subsets i.e. subset #1 (IGHV clan I/IGHD6-19/IGHJ4) and subset #6 (IGHV1-69/IGHD3-16/IGHJ3). These subsets carry germline-encoded amino acid (aa) motifs within the VH CDR3, namely QWL and YDYVWGSY, originating from the IGHD6-19 and IGHD3-16 gene, respectively. However, in both subsets, cases exist with variations in these motifs that could potentially represent SHM.The present study included 12 subset #1 and 5 subset #6 patients with clonotypic IGHV genes lacking any SHM (100\% germline identity). IGHV-IGHD-IGHJ gene rearrangements were RT-PCR amplified by subgroup-specific leader primers and a high-fidelity polymerase in order to ensure high data quality. RT-PCR products were subjected to paired-end NGS on the MiSeq platform. Sequence annotation was performed with IMGT/HighV-QUEST and metadata analysis was undertaken using an in-house purpose-built bioinformatics pipeline. Rearrangements with the same IGHV gene and identical VH CDR3 aa sequences were defined as clonotypes.Overall, we obtained 1,570,668 productive reads with V-region identity 99-100\%; of these, 1,232,958 (mean: 102,746, range: 20,796-242,519) concerned subset #1 while 337,710 (mean: 67,542, range: 50,403-79,683) concerned subset #6. On average, 64.4\% (range: 1.7-77.5\%) of subset #1 reads and 49.2\% (range: 0.7-70\%) of subset #6 reads corresponded to rearrangements with IGHV genes lacking any SHM (100\% germline identity). Clonotype computation revealed 1,831 and 1,048 unique clonotypes for subset #1 and #6, respectively. Subset #1 displayed a mean of 157 distinct clonotypes per sample (range: 74-267), with the dominant clonotype having a mean frequency of 96.9\% (range: 96-98.2\%). Of note, 44 clonotypes were shared between different patients (albeit at varying frequencies), including the dominant clonotype of 11/12 cases, which was present in 2-6 additional subset #1 patients. Subset #6 cases carried a higher number of distinct clonotypes per sample (mean: 219, range: 189-243) while the dominant clonotype had a mean frequency of 95.6\% (range: 94.5-96.5\%). Shared clonotypes (n=30) were identified also in subset #6 and the dominant clonotype of each subset #6 case was present in 3-5 additional subset #6 patients. Focusing on the VH CDR3, in particular the IGHD-encoded part, the following observations were made: (1) in both subsets, extensive intra-VH CDR3 variation was detected at certain positions within the IGHD gene; (2) in most cases, the observed aa substitutions were conservative i.e. concerned aa sharing similar physicochemical properties. Particularly noteworthy in this respect were the observations in subset #6 that: (i) the valine residue (V) in the D-derived YDYVWGSY motif was very frequently mutated to another aliphatic residue (A, I, L); (ii) in cases were the predominant clonotype carried I (also in the Sanger-derived sequence), several minor clonotypes carried the germline-encoded V, compelling evidence that the observed substitution concerned true SHM.In conclusion, we provide immunogenetic evidence for intra-VH CDR3 variations, very likely attributed to SHM, in CLL patients carrying ’truly unmutated’ IGHV genes. While the prognostic/predictive relevance of this observation is beyond the scope of the present work, our findings highlight the possible need to reappraise definitions (’semantics’) regarding SHM status in CLL.Stamatopoulos:Janssen: Honoraria, Research Funding; Abbvie: Honoraria, Research Funding. Chatzidimitriou:Janssen: Honoraria.
- M. Gerousi, F. Psomopoulos, K. Kotta, N. Stavroyianni, A. Anagnostopoulos, I. Kotsianidis, S. Ntoufa, and K. Stamatopoulos, “Functional Calcitriol/Vitamin D Receptor Signaling in Chronic Lymphocytic Leukemia,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 3019–3019, doi: 10.1182/blood-2019-127910.
Calcitriol, the biologically active form of vitamin D, modulates a plethora of cellular processes following its receptor ligation, namely the vitamin D receptor (VDR), a nuclear transcription factor that regulates the transcription of diverse genes. It has been proposed that vitamin D may play a role in prevention and treatment of cancer while epidemiological studies have linked vitamin D insufficiency to adverse disease outcome in chronic lymphocytic leukemia (CLL). Recently, we reported that VDR is functional in CLL cells after calcitriol supplementation, as well as after stimulation through both the calcitriol/VDR signaling system and other prosurvival pathways triggered from the tumor microenvironment. In this study, we aimed at investigating key molecules and signaling pathways that are altered after calcitriol treatment and are known to play a relevant role in CLL pathophysiology.CD19+ primary CLL cells were negatively selected from peripheral blood samples of patients that were treatment naïve at the time of sample collection. CLL cells were cultured in vitro with calcitriol or co-cultured with the HS-5 mesenchymal cell line for 24 hours. VDR+, CYP24A1+, phospho-ERK+ and phospho-NF-κB p65+ cells were determined by Flow Cytometry (FC). Total RNA was extracted from calcitriol-treated and non-treated CLL cells, while mRNA selection was performed using NEBNext Poly(A) mRNA Magnetic Isolation Module. Library preparation for RNA-Sequencing (RNA-Seq) analysis was conducted with the NEBNext Ultra II Directional RNA Library Prep Kit. The libraries were paired-end sequenced on the NextSeq 500 Illumina platform. Differential expression analysis was performed using DESeq2; genes with log2FC\>|1| and P≤0.05 were considered as differentially expressed.RNA-Seq analysis (n=6) confirmed our previous findings that the CYP24A1 gene is significantly upregulated by calcitriol, being the top upregulated gene, whereas the VDR gene remains unaffected by this treatment. Overall, 85 genes were differentially expressed in unstimulated versus calcitriol-treated cells, of which 28 were overexpressed in the latter thus contrasting the remaining 57 which showed the opposite pattern. Pathway enrichment and gene ontology (GO) analysis of the differentially expressed genes revealed significant enrichment in PI3K-Akt pathway and Toll-like receptor cascades, as well as in vitamin D metabolism and inflammatory response pathways. Additionally, flow cytometric analysis showed that calcitriol-treated CLL cells displayed increased pERKlevels (FD=1.3, p\<0.05) and, in contrast decreased pNF-κBlevels (FD=2.7, p\<0.05), highlighting active VDR signaling in CLL. Aiming at placing our findings in a more physiological context, we co-cultured CLL cells with the HS-5 cell line. Based on our previous finding that co-cultured CLL cells showed induced CYP24A1 levels, we evaluated pNF-κB expression. pNF-κB levels were found to be increased in co-cultured CLL cells (FD=4.2, p\<0.05), while the addition of calcitriol downregulated pNF-κB (FD=1.5, p\<0.05). Moreover, ex vivo calcitriol exposure of CLL cells from patients under ibrutinib treatment (at baseline, +1 and +3-6 months, n=7) resulted in significant upregulation of pERK (FD=1.6, p\<0.01; FD=1.4, p\<0.01; FD=1.9, p\<0.01; for each timepoint respectively) but, significant downregulation of pNF-κΒ (FD=3.4, p\<0.01; FD=3, p\<0.05; FD=2.3, p\<0.05; for each timepoint respectively), indicating preserved calcitriol/VDR signaling capacity.In conclusion, we provide evidence that the calcitriol/VDR system is active in CLL, modulating NF-κB and MAPK signaling as well as the expression of the CYP24A1 target gene. This observation is further supported by RNA-Seq analysis that confirms CYP24A1 upregulation and highlights new signaling pathways that need to be validated. Interestingly, the calcitriol/VDR system appears relatively unaffected by either stimulation or inhibition (ibrutinib) of microenvironmental signals that promote CLL cell survival and/or proliferation, indicating context-independent signaling capacity.Kotsianidis:Celgene: Research Funding. Stamatopoulos:Janssen: Honoraria, Research Funding; Abbvie: Honoraria, Research Funding.
- K. Gemenetzi, A. Agathangelidis, F. Psomopoulos, K. Plevova, L.-A. Sutton, K. Pasentsis, A. Anagnostopoulos, R. Sandaltzopoulos, R. Rosenquist, F. Davi, S. Pospisilova, K. Stamatopoulos, and A. Chatzidimitriou, “Higher Order Restrictions of the Immunoglobulin Repertoire in CLL: The Illustrative Case of Stereotyped Subsets #2 and #169,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 5453–5453, doi: 10.1182/blood-2019-128017.
Stereotyped subset #2 (IGHV3-21/IGLV3-21) is the largest subset in CLL ( 3\% of all patients). Membership in subset #2 is clinically relevant since these patients experience an aggressive disease irrespective of the somatic hypermutation (SHM) status of the clonotypic immunoglobulin heavy variable (IGHV) gene. Low-throughput evidence suggests that stereotyped subset #169, a minor CLL subset ( 0.2\% of all CLL), resembles subset #2 at the immunogenetic level. More specifically: (i) the clonotypic heavy chain (HC) of subset #169 is encoded by the IGHV3-48 gene which is closely related to the IGHV3-21 gene; (ii) both subsets carry VH CDR3s comprising 9-amino acids (aa) with a conserved aspartic acid (D) at VH CDR3 position 3; (iii) both subsets bear light chains (LC) encoded by the IGLV3-21 gene with a restricted VL CDR3; and, (iv) both subsets have borderline SHM status. Here we comprehensively assessed the ontogenetic relationship between CLL subsets #2 and #169 by analyzing their immunogenetic signatures. Utilizing next-generation sequencing (NGS) we studied the HC and LC gene rearrangements of 6 subset #169 patients and 20 subset #2 cases. In brief, IGHV-IGHD-IGHJ and IGLV-IGLJ gene rearrangements were RT-PCR amplified using subgroup-specific leader primers as well as IGHJ and IGLC primers, respectively. Libraries were sequenced on the MiSeq Illumina instrument. IG sequence annotation was performed with IMGT/HighV-QUEST and metadata analysis conducted using an in-house, validated bioinformatics pipeline. Rearrangements with identical CDR3 aa sequences were herein defined as clonotypes, whereas clonotypes with different aa substitutions within the V-domain were defined as subclones. For the HC analysis of subset #169, we obtained 894,849 productive sequences (mean: 127,836, range: 87,509-208,019). On average, each analyzed sample carried 54 clonotypes (range: 44-68); the dominant clonotype had a mean frequency of 99.1\% (range: 98.8-99.2\%) and displayed considerable intraclonal heterogeneity with a mean of 2,641 subclones/sample (range: 1,566-6,533). For the LCs of subset #169, we obtained 2,096,728 productive sequences (mean: 299,533, range: 186,637-389,258). LCs carried a higher number of distinct clonotypes/sample compared to their partner HCs (mean: 148, range: 110-205); the dominant clonotype had a mean frequency of 98.1\% (range: 97.2-98.6\%). Intraclonal heterogeneity was also observed in the LCs, with a mean of 6,325 subclones/sample (range: 4,651-11,444), hence more pronounced than in their partner HCs. Viewing each of the cumulative VH and VL CDR3 sequence datasets as a single entity branching through diversification enabled the identification of common sequences. In particular, 2 VH clonotypes were present in 3/6 cases, while a single VL clonotype was present in all 6 cases, albeit at varying frequencies; interestingly, this VL CDR3 sequence was also detected in all subset #2 cases, underscoring the molecular similarities between the two subsets. Focusing on SHM, the following observations were made: (i) the frequent 3-nucleotide (AGT) deletion evidenced in the VH CDR2 of subset #2 (leading to the deletion of one of 5 consecutive serine residues) was also detected in all subset #169 cases at subclonal level (average: 6\% per sample, range: 0.1-10.8\%); of note, the 5-serine stretch is also present in the germline VH CDR2 of the IGHV3-48 gene; (ii) the R-to-G substitution at the VL-CL linker, a ubiquitous SHM in subset #2 and previously reported as critical for IG self-association leading to cell autonomous signaling in this subset, was present in all subset #169 samples as a clonal event with a mean frequency of 98.3\%; and, finally, (iii) the S-to-G substitution at position 6 of the VL CDR3, present in all subset #2 cases (mean : 44.2\% ,range: 6.3-87\%), was also found in all #169 samples, representing a clonal event in 1 case (97.2\% of all clonotypes) and a subclonal event in the remaining 5 cases (mean: 0.6\%, range: 0.4-1.1\%). In conclusion, the present high-throughput sequencing data cements the immunogenetic relatedness of CLL stereotyped subsets #2 and #169, further highlighting the role of antigen selection throughout their natural history. These findings also argue for a similar pathophysiology for these subsets that could also be reflected in a similar clonal behavior, with implications for risk stratification.Sutton:Abbvie: Honoraria; Gilead: Honoraria; Janssen: Honoraria. Stamatopoulos:Abbvie: Honoraria, Research Funding; Janssen: Honoraria, Research Funding. Chatzidimitriou:Janssen: Honoraria.
- M. Tsagiopoulou, V. Chapaprieta, N. Russiñol, F. Psomopoulos, N. Papakonstantinou, N. Stavroyianni, A. Anagnostopoulos, P. Kollia, E. Campo, K. Stamatopoulos, and J. I. Martin-Subero, “Genome-Wide Histone Acetylation Profiling in Chronic Lymphocytic Leukemia Reveals a Distinctive Signature in Stereotyped Subset #8,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 1241–1241, doi: 10.1182/blood-2019-127817.
In CLL, subsets of patients carrying stereotyped B cell receptors (BcR) share similar biological and clinical features independently of IGHV gene somatic hypermutation status. Although the chromatin landscape of CLL as a whole has been recently characterized, it remains largely unexplored in stereotyped cases. Here, we analyzed the active chromatin regulatory landscape of 3 major CLL stereotyped subsets associated with clinical aggressiveness.We performed chromatin-immunoprecipitation followed by sequencing (ChIP-Seq) with an antibody for the H3K27ac histone mark in sorted CLL cells from 19 cases, including clinically aggressive subsets #1 (clan I genes/IGKV(D)1-39, IG-unmutated CLL (U-CLL)(n=3)], #2 [IGHV3-21/IGLV3-21, IG-mutated CLL (M-CLL)(n=3)] and #8 [IGHV4-39/IGKV1(D)-39, U-CLL(n=3)] which we compared to non-stereotyped CLL cases [5 M-CLL|5 U-CLL]. In addition, a series of 15 normal B cell samples from different stages of B-cell differentiation were analyzed [naive B cells from peripheral blood (n=3), tonsillar naive B cells (n=3), germinal centre (GC) B cells (n=3), memory B cells (n=3), tonsillar plasma cells (n=3)].Initial unsupervised principal component analysis (PCA) disclosed a distinct chromatin acetylation pattern in CLL, regardless of stereotypy status, versus normal B cells. CLL as a whole was found to be closer to naive and memory B cells rather than GC B cells and plasma cells. Detailed analysis of individual principal components (PC) revealed that PC4, which accounts for 5\% of the total variability, segregated subset #8 cases and GC B cells from other CLLs and normal B cell subpopulations. Although PC4 accounts for only a small part of the total variability (5\%), this suggests that subset #8 cases may share some chromatin features with proliferating GC B cells, in line with the fact that subset #8 BcR are IgG-switched.We also investigated whether stereotyped CLLs have different chromatin acetylation features compared to non-stereotyped CLLs matched by IGHV somatic hypermutation status and identified 878 Differential Regions (DR) in subset #8 vs. U-CLL, 84 DR in subset #1 vs. U-CLL and 66 DR in #2 compared vs. M-CLL.As subset #8 cases seemed to have the most distinct profile, we further characterized the detected regions. The 435 and 443 regions gaining and losing activation, respectively, mostly targeted promoters (29.5\%) and regulatory elements located in introns (31\%) and distal intergenic regions (21.8\%). Hierarchical clustering based on the 878 DRs enabled the clear discrimination of subset #8 cases from U-CLL and normal B cells; however, it is worth noting that for several of these 878 DRs the acetylation patterns were shared between subset #8 and normal B cell subpopulations rather than subset #8 and U-CLL.Of note, 11/435 regions gaining activity on subset #8 were found within the gene encoding for the EBF1 transcription factor (TF); additional regions were associated with genes significant to CLL pathogenesis, e.g. TCF4 and E2F1. Moreover, 3 DRs losing activity in subset #8 were located within the CTLA4 gene and 2 DRs within the IL21R gene, which we have recently reported as hypermethylated and not expressed in subset #8.Next, we performed TF binding site analysis by MEME/AME suit, separately for regions gaining or losing activity, and identified significant enrichment (adj-p\<0.001) on TFs such as AP-1, FOX, GATA, IRF. The regions losing activity in subset #8 showed a higher number of enriched TFs versus those gaining activity (165 vs 93 TFs), particularly displaying enrichment for many HOX family members . However, a cluster of TFs with enrichment on TF binding site analysis, such as FOXO1, FOXP1, MEF2D, PRDM1, RUNX1, RXRA, STAT6, were also located within the 878 DRs discriminating subset #8 from either U-CLL or normal B cell subpopulations.Taken together, subset #8 cases have a distinct chromatin acetylation signature which includes both loss and gain of active elements, shared features with proliferating GC B cells, and specific changes in chromatin activity of several genes and TFs relevant to B cell/CLL biology. These findings further underscore the concept that BcR stereotypy defines subsets of patients with consistent biological profile, while they may also be relevant to the particular clinical behavior of subset #8, known to be associated with the highest risk of Richter’s transformation amongst all CLL.Stamatopoulos:Abbvie: Honoraria, Research Funding; Janssen: Honoraria, Research Funding.
Other (Posters/Slides/Datasets)
2025
- S.-C. Fragkouli, A. Agathangelidis, and F. E. Psomopoulos, “25 Synthetic TP53 Genomic Datasets for Benchmarking and Method Development.” Oct. 2025, doi: 10.5281/zenodo.16524193.
This collection contains 25 synthetic genomics datasets generated using NEAT v3, simulating the TP53 gene of Homo sapiens. These datasets are intended for benchmarking somatic variant calling algorithms, especially in tumor-only settings. Each dataset is composed of paired-end reads and was designed to systematically explore the impact of two intrinsic NGS parameters on variant detection performance: Sequencing coverage: 300×, 700×, 1000×, 3000×, and 5000× Read length: 50 bp, 75 bp, 100 bp, 150 bp, and 300 bp
- S.-C. Fragkouli, S. Iqbal, L. Crossman, B. Gravel, N. Masued, M. Onders, D. Haseja, A. Stikkelman, A. Valencia, T. Lenaerts, F. Psomopoulos, P. Ó. Broin, N. Queralt-Rosinach, and D. Cirillo, “An ELIXIR scoping review on domain-specific evaluation metrics for synthetic data in life sciences.” 2025, [Online]. Available at: https://arxiv.org/abs/2506.14508.
Synthetic data has emerged as a powerful resource in life sciences, offering solutions for data scarcity, privacy protection and accessibility constraints. By creating artificial datasets that mirror the characteristics of real data, allows researchers to develop and validate computational methods in controlled environments. Despite its promise, the adoption of synthetic data in Life Sciences hinges on rigorous evaluation metrics designed to assess their fidelity and reliability. To explore the current landscape of synthetic data evaluation metrics in several Life Sciences domains, the ELIXIR Machine Learning Focus Group performed a systematic review of the scientific literature following the PRISMA guidelines. Six critical domains were examined to identify current practices for assessing synthetic data. Findings reveal that, while generation methods are rapidly evolving, systematic evaluation is often overlooked, limiting researchers ability to compare, validate, and trust synthetic datasets across different domains. This systematic review underscores the urgent need for robust, standardized evaluation approaches that not only bolster confidence in synthetic data but also guide its effective and responsible implementation. By laying the groundwork for establishing domain-specific yet interoperable standards, this scoping review paves the way for future initiatives aimed at enhancing the role of synthetic data in scientific discovery, clinical practice and beyond.
- G. Farrell et al., “Open and Sustainable AI: challenges, opportunities and the road ahead in the life sciences.” 2025, [Online]. Available at: https://arxiv.org/abs/2505.16619.
Artificial intelligence (AI) has recently seen transformative breakthroughs in the life sciences, expanding possibilities for researchers to interpret biological information at an unprecedented capacity, with novel applications and advances being made almost daily. In order to maximise return on the growing investments in AI-based life science research and accelerate this progress, it has become urgent to address the exacerbation of long-standing research challenges arising from the rapid adoption of AI methods. We review the increased erosion of trust in AI research outputs, driven by the issues of poor reusability and reproducibility, and highlight their consequent impact on environmental sustainability. Furthermore, we discuss the fragmented components of the AI ecosystem and lack of guiding pathways to best support Open and Sustainable AI (OSAI) model development. In response, this perspective introduces a practical set of OSAI recommendations directly mapped to over 300 components of the AI ecosystem. Our work connects researchers with relevant AI resources, facilitating the implementation of sustainable, reusable and transparent AI. Built upon life science community consensus and aligned to existing efforts, the outputs of this perspective are designed to aid the future development of policy and structured pathways for guiding AI implementation.
2024
- O. A. Attafi et al., “DOME Registry: Implementing community-wide recommendations for reporting supervised machine learning in biology.” 2024, [Online]. Available at: https://arxiv.org/abs/2408.07721.
Supervised machine learning (ML) is used extensively in biology and deserves closer scrutiny. The DOME recommendations aim to enhance the validation and reproducibility of ML research by establishing standards for key aspects such as data handling and processing, optimization, evaluation, and model interpretability. The recommendations help to ensure that key details are reported transparently by providing a structured set of questions. Here, we introduce the DOME Registry (URL: this http URL), a database that allows scientists to manage and access comprehensive DOME-related information on published ML studies. The registry uses external resources like ORCID, APICURON and the Data Stewardship Wizard to streamline the annotation process and ensure comprehensive documentation. By assigning unique identifiers and DOME scores to publications, the registry fosters a standardized evaluation of ML methods. Future plans include continuing to grow the registry through community curation, improving the DOME score definition and encouraging publishers to adopt DOME standards, promoting transparency and reproducibility of ML in the life sciences.
- F. Psomopoulos, E. Capriotti, N. Rosinach, D. Cirillo, L. Castro, S. Tosatto, and the ELIXIR ML Focus Group members, “The impact of the ELIXIR community in Machine Learning.” ELIXIR All Hands Meeting, Jun. 2024, doi: 10.7490/f1000research.1119794.1.
With the continuous generation, processing, and transformation of biological data, the application of Machine Learning (ML) has become one of the most effective approaches for extracting insights from this data and supporting decision-making processes. The ELIXIR Machine Learning Focus Group launched in October 2019, specifically to capture the emerging needs in ML expertise across the community, and has been consistently producing high-impact outputs over the past 5 years. Its overarching goals include all aspects of ML, from standards and reproducibility, to benchmarking and training. One key output of the FG is the publication of the DOME recommendations (Nat Methods, July 2021), a set of community-wide recommendations for reporting supervised ML-based analyses of biological studies. Broad adoption of these recommendations help improve ML assessment and reproducibility. They have been effectively used in a more ambitious Strategic Implementation Study, leading to a coherent registry around DOME and a community-led publication annotation effort. Another key activity of the FG aims to tackle challenges of synthetic data, creating frameworks and best practices for the generation, evaluation, and application of synthetic data. Achievements include a synthetic data catalogue, a FAIR metadata model, a data registry, an evaluation metrics review, and a community survey. Finally, as AI-ready data are a necessity in ML, a comprehensive effort around gold standard datasets is in place, collecting both paper candidates with datasets on selected domains (e.g. metabolomics/omics), as well as collected datasets that can be applied in ML in Life Sciences, specifically for supervised learning, around omics.
- S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Exploring Somatic Variant Callers’ Behavior: A Synthetic Genomics Feature Space Approach.” ELIXIR All Hands Meeting, Jun. 2024, doi: 10.7490/f1000research.1119793.1.
The identification of somatic variants using algorithms is crucial for detecting genomic alterations associated with cancer. However, assessing their performance faces challenges due to the limited availability of high-quality ground truth datasets. To tackle this issue, we implemented synth4bench[1], a framework, which calls the NEATv3.3 simulator to generate synthetic genomics data, implements all somatic variant calling algorithms and then benchmarked.the results against the ground truth.
- S.-C. Fragkouli, D. Solanki, L. J. Castro, F. E. Psomopoulos, N. Queralt-Rosinach, D. Cirillo, and L. C. Crossman, “Synthetic data: How could it be used for infectious disease research?” 2024, [Online]. Available at: https://arxiv.org/abs/2407.06211.
Over the last three to five years, it has become possible to generate machine learning synthetic data for healthcare-related uses. However, concerns have been raised about potential negative factors associated with the possibilities of artificial dataset generation. These include the potential misuse of generative artificial intelligence (AI) in fields such as cybercrime, the use of deepfakes and fake news to deceive or manipulate, and displacement of human jobs across various market sectors. Here, we consider both current and future positive advances and possibilities with synthetic datasets. Synthetic data offers significant benefits, particularly in data privacy, research, in balancing datasets and reducing bias in machine learning models. Generative AI is an artificial intelligence genre capable of creating text, images, video or other data using generative models. The recent explosion of interest in GenAI was heralded by the invention and speedy move to use of large language models (LLM). These computational models are able to achieve general-purpose language generation and other natural language processing tasks and are based on transformer architectures, which made an evolutionary leap from previous neural network architectures. Fuelled by the advent of improved GenAI techniques and wide scale usage, this is surely the time to consider how synthetic data can be used to advance infectious disease research. In this commentary we aim to create an overview of the current and future position of synthetic data in infectious disease research.
- F. Psomopoulos, “FAIR for Machine Learning; Building on the Lessons from FAIR Software.” Zenodo, 2024, doi: 10.5281/ZENODO.10953108.
Ensuring that data are FAIR is nowadays a clear expectation across all science domains, as a result of many years of global efforts. Research software, has only just started to receive the same level of attention in recent years, with targeted actions towards the definition of the FAIR principles as applied to research software, as well as concerted efforts around reproducibility, quality, and sustainability. Given the rapid rise of ML as a key technology across all science domains, it is important to build on our collective experience, and already start addressing the challenges ahead of us, towards making ML FAIR.
- S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.” 2024, doi: 10.1101/2024.03.07.582313.
Motivation Somatic variant calling algorithms are widely used to detect genomic alterations associated with cancer. Evaluating their performance, even though being crucial, can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic data generation framework for benchmarking these algorithms, focusing on the TP53 gene, utilizing the NEATv3.3 simulator. We thoroughly evaluated the performance of Mutect2, Freebayes, VarDict, VarScan2 and LoFreq and compared their results with our synthetic ground truth, while observing their behavior. Synth4bench attempts to shed light on the underlying principles of each variant caller by presenting them with data from a given range across the genomics data feature space and inspecting their response. Results Using synthetic dataset as ground truth provides an excellent approach for evaluating the performance of tumor-only somatic variant calling algorithms. Our findings are supported by an independent statistical analysis that was performed on the same data and output from all callers. Overall, synth4bench leverages the effort of benchmarking algorithms by offering the opportunity to utilize a generated ground truth dataset. This kind of framework is essential in the field of cancer genomics, where precision is an ultimate necessity, especially for variants of low frequency. In this context, our approach makes comparison of various algorithms transparent, straightforward and also enhances their comparability.
- F. Adriano, E. Parkinson, D. Bianchini, F. Psomopoulos, M. Varadi, M. Andrabi, S.-C. Fragkouli, and U. Vadadokhau, “RDMkit, Your Domain, Machine Learning.” 2024, [Online]. Available at: https://rdmkit.elixir-europe.org/machine_learning.
- F. Liberante, F. Psomopoulos, G. Farrell, S. Suchánek, P. Lieby, M. Maccallum, S. Gundersen, and W. Nyberg Åkerström, “ELIXIR Report of the 21st Plenary of the RDA, October 2023.” Zenodo, 2024, doi: 10.5281/ZENODO.10721761.
This report document has been prepared by ELIXIR’s RDA Activities Focus Group to showcase the synergies and activities of the Research Data Alliance (RDA) which may be useful for ELIXIR members operating in the life sciences domain. The RDA was launched as a community-driven initiative in 2013 with the goal of building the social and technical infrastructure to enable open sharing and re-use of data. The ELIXIR RDA Activities Focus Group has prepared twenty reports of RDA Plenary events to date containing overviews of highlighted RDA recommendations and outputs. This report contains highlights from various sessions of the RDA 21st Plenary which took place as a hybrid event in Austria, 23-26th October, 2023.
- S.-C. Fragkouli, A. Agathangelidis, and F. E. Psomopoulos, “10 Synthetic Genomics Datasets.” Feb. 2024, doi: 10.5281/zenodo.10683211.
These are 10 synthetic genomics datasets generated with NEAT v3 (based on TP53 gene of Homo Sapiens) for the use case of benchmarking somatic variant callers. To find more about our generating framework please visit synth4bench GitHub repository. The datasets explore intrinsic NGS data parameters for the use case of observing their effect on tumor-only somatic variant calling algorithms. From the 10 datasets, there are 5 of them with different coverage (while keeping all other parameters fixed) and 5 with varying read length. The reads in all datasets are paired-end .
- F. Psomopoulos, S. Capella-Gutierrez, L. Portell-Silva, and N. Pechlivanis, “EOSC-EVERSE: Paving the way towards a European Virtual Institute for Research Software Excellence.” Zenodo, 2024, doi: 10.5281/ZENODO.10526785.
The EVERSE project aims to create a framework for research software and code excellence, collaboratively designed and championed by the research communities across five EOSC Science Clusters and national Research Software Expertise Centres, in pursuit of building a European network of Research Software Quality and setting the foundations of a future Virtual Institute for Research Software Excellence. This framework for research software excellence will incorporate aspects involving community curation, quality assessment, and best practices for research software. This collective knowledge will be captured in the Research Software Quality toolkit (RSQkit), a knowledge base to gather and curate expertise that will contribute to high-quality software and code across different disciplines. By embedding the RSQkit and services into the EOSC Science Clusters, EVERSE will demonstrate improvements in the quality of research software and maximise its reuse, leading to standardised software development practices and sustainable research software. Furthermore, we will drive recognition of software and support career progress for developers, from researchers who code to RSEs, raising their capacity to guarantee software quality. The European network for Research Software Quality aims to cross-fertilise different research domains, act as a lobbying organisation, and raise awareness of software as a key enabler in research, with the overall ambition to accelerate research and innovation through improving the quality of research software and code. EVERSE ultimate ambition is to contribute towards a cultural change where research software is recognized as a first-class citizen of the scientific process and the people that contribute to it are credited for their efforts. These slides were presented in the context of the ReSA Funders’ Forum meeting (January 16th/17th 2024), as well as at the International Research Software Engineering Research (IRSER) Community Meetup (January 17th 2024)
2023
- G. I. Gavriilidis, Sofoklis, Thomas, Konstantinos, and Fotis, “PertFlow: A cloud-based workflow to facilitate perturbational modeling on single-cell transcriptomics for pharmacological research.” Zenodo, 2023, doi: 10.5281/ZENODO.8350620.
Perturbational modeling in single-cell -omics computationally captures, at unprecedented cellular resolution, responses to molecular changes initiated by gene knockdowns or drug treatments. Notwithstanding, relevant in silico tools are hindered by interoperability issues, hefty computational demands, and reliance on complex algorithms like Deep Learning that lack biological interpretation. Here, we introduce "PertFlow", a user-friendly, cloud-based workflow merging standard single-cell pipelines for scRNA-seq/ECCITE-seq with specialized perturbational modeling tools. PertFlow offers seamless Seurat and Scanpy interoperability through in-tandem Python and R coding (Rpy2 package). A Google Colab implementation of the method demonstrates the ease of deployment and allows for testing by other users. At first, PertFlow enables pathway/transcription factor (TFs) enrichment (DecoupleR) to establish the necessary biological context. At its core, PertFlow employs AugurPy for cell-type prioritization, scGEN variational autoencoder for perturbation response prediction, and MixScape for assessing perturbations in single-cell pooled CRISPR screens (ECCITE-seq). Moreover, PertFlow also features the CPA compositional autoencoder for complex perturbational predictions and the ASGARD toolkit for drug repurposing based on LINCS L1000 project data. When applied to Chronic Lymphocytic Leukemia (CLL) scRNA-seq data from peripheral blood cells, pre/post-Ibrutinib therapy (PMID: 31996669), PertFlow was able to capture biological ground truths (suppression of oxidative phosphorylation)(DecoupleR), but also went beyond them, showing: (a) cell prioritization of monocytes 30 days post-Ibrutinib and implication of galectins in a poor CLL Ibrutinib responder (AugurPy), (b) perturbational predictions for CLL-geared TFs like IRF1 (MixScape) (c) repurposed drugs mimicking Ibrutinib’s effects like auranofin, fostamatinib, parthenolide, vorinostat, idelalisib and sonidegib (ASGARD).
- F. Psomopoulos, G. Juckeland, G. A. Stewart, S. Roiser, S. Capella-Gutierrez, L. Portell-Silva, P. Bos, J. Maassen, T. Vuillaume, N. Chue Hong, D. Garijo, J. Tedds, C. Doglioni, and C. Goble, “EOSC EVERSE: Paving the way towards a European Virtual Institute for Research Software Excellence.” Zenodo, 2023, doi: 10.5281/ZENODO.10183077.
Extended abstract of the EOSC EVERSE project (https://everse.software/), submitted for presentation in the International Research Software Engineering Research (IRSER) Community Meetup in January 2024 (https://www.software.ac.uk/news/international-research-software-engineering-research-irser-community-meetup)
- S.-C. Fragkouli, N. Pechlivanis, A. Orfanou, A. Anastasiadou, A. Agathangelidis, and F. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of somatic variant calling algorithms, 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB),” 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB). Oct. 2023, doi: 10.5281/zenodo.8432060.
Somatic variant calling algorithms are widely used to detect genomic alterations associated with cancer. Evaluating the performance of these algorithms can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic genomics data generation and evaluation framework for benchmarking somatic variant calling algorithms. We generated synthetic datasets based on sequence data from the TP53 gene, using the NEAT simulator. Subsequently, we thoroughly evaluated the performance of GATK-Mutect2 on these datasets, and compared the results to the “golden” files produced by NEAT containing the actual variations. Our results demonstrate that the synthetic datasets generated using our framework can accurately capture the complexity and diversity of real cancer genomic data. Moreover, the synthetic datasets provide an excellent ground truth for evaluating the performance of somatic variant calling algorithms. Altogether, our framework provides a valuable resource for testing the performance of somatic variant calling algorithms, enabling researchers to evaluate and improve the accuracy of these algorithms for cancer genomics applications.
- F. Psomopoulos, “From FAIR data to FAIR Research Software, towards FAIR Machine Learning.” Zenodo, 2023, doi: 10.5281/ZENODO.10212280.
FAIR principles are now integral to handling research data, with ongoing challenges notwithstanding. However, the FAIR Principles, at a high level, are intended to apply to all research objects; both those used in research and those that are research outputs. Here we highlight the efforts around FAIR research software as well as Machine Learning, while emphasising the need for community-led standards and best practices in this area. This presentation was part of a panel discussion on Data Management at the Bio-IT World Conference & Expo Europe, 29-30 November 2023 (bio-itworldeurope.com)
- S. G. Sutcliffe et al., “Tracking SARS-CoV-2 variants of concern in wastewater: an assessment of nine computational tools using simulated genomic data,” bioRxiv. Cold Spring Harbor Laboratory, Dec. 2023, doi: 10.1101/2023.12.20.572426.
Wastewater-based surveillance (WBS) is an important epidemiological and public health tool for tracking pathogens across the scale of a building, neighbourhood, city, or region. WBS gained widespread adoption globally during the SARS-CoV-2 pandemic for estimating community infection levels by qPCR. Sequencing pathogen genes or genomes from wastewater adds information about pathogen genetic diversity which can be used to identify viral lineages (including variants of concern) that are circulating in a local population. Capturing the genetic diversity by WBS sequencing is not trivial, as wastewater samples often contain a diverse mixture of viral lineages with real mutations and sequencing errors, which must be deconvoluted computationally from short sequencing reads. In this study we assess nine different computational tools that have recently been developed to address this challenge. We simulated 100 wastewater sequence samples consisting of SARS-CoV-2 BA.1, BA.2, and Delta lineages, in various mixtures, as well as a Delta-Omicron recombinant and a synthetic “novel” lineage. Most tools performed well in identifying the true lineages present and estimating their relative abundances, and were generally robust to variation in sequencing depth and read length. While many tools identified lineages present down to 1% frequency, results were more reliable above a 5% threshold. The presence of an unknown synthetic lineage, which represents an unclassified SARS-CoV-2 lineage, increases the error in relative abundance estimates of other lineages, but the magnitude of this effect was small for most tools. The tools also varied in how they labelled novel synthetic lineages and recombinants. While our simulated dataset represents just one of many possible use cases for these methods, we hope it helps users understand potential sources of noise or bias in wastewater sequencing data and to appreciate the commonalities and differences across methods.
- F. E. Psomopoulos, K. A. Kyritsis, I. Topolsky, B. Batut, A. H. Fitzpatrick, and G. Leoni, “Exploring the landscape of the genomic wastewater surveillance ecosystem: a roadmap towards standardization.” Center for Open Science, Nov. 2023, doi: 10.37044/osf.io/rtgk9.
Nearly two years after the initial report of SARS-CoV-2 in Wuhan, China, the COVID-19pandemic has affected over 485 million individuals. Wastewater surveillance has garneredsubstantial attention as a passive monitoring system to complement clinical genomic surveillanceactivities during the SARS-CoV-2 pandemic. Several effective methods are now in place fordetecting and quantifying viral RNA in wastewater samples, and it is evident that RNAconcentrations in wastewater correlate with reported case trends.Exploratory projects have demonstrated the potential of Wastewater Based Epidemiology(WBE), nevertheless it is imperative to coordinate efforts, establish standards, and create acatalog of available software tools and services. This coordination will streamline the deploymentof end-to-end genomic wastewater surveillance pipelines and promote the adoption of thesemonitoring methods within the broader scientific community. The initial step involves identifyingand cataloging the challenges of working with wastewater HTS and the pertinent methodologiesand bioinformatics workflows essential for managing genomic data from wastewater samples,thus forming a coherent structure.The primary objective of this project was to systematically review, compile, and initiate theintegration, standardization, and documentation of diverse approaches for genomic wastewatersurveillance. Drawing on the expertise of the ELIXIR Wastewater Surveillance Working Group,our focus was to create a comprehensive framework of components, including modules andtools, to facilitate the practical implementation of end-to-end genomic wastewater surveillancepipelines.
- R. M. Waterhouse, A.-F. Adam-Blondon, B. Balech, E. Barta, K. F. Heil, G. M. Hughes, L. S. Jermiin, M. Kalaš, J. Lanfear, E. Pafilis, A. C. Papageorgiou, F. Psomopoulos, N. Raes, J. Burgin, and T. Gabaldón, “F1000Research Article: The ELIXIR Biodiversity Community: Understanding short- and long-term changes in biodiversity. — f1000research.com.” 2023, doi: 10.12688/f1000research.133724.2.
Biodiversity loss is now recognised as one of the major challenges for humankind to address over the next few decades. Unless major actions are taken, the sixth mass extinction will lead to catastrophic effects on the Earth’s biosphere and human health and well-being. ELIXIR can help address the technical challenges of biodiversity science, through leveraging its suite of services and expertise to enable data management and analysis activities that enhance our understanding of life on Earth and facilitate biodiversity preservation and restoration. This white paper, prepared by the ELIXIR Biodiversity Community, summarises the current status and responses, and presents a set of plans, both technical and community-oriented, that should both enhance how ELIXIR Services are applied in the biodiversity field and how ELIXIR builds connections across the many other infrastructures active in this area. We discuss the areas of highest priority, how they can be implemented in cooperation with the ELIXIR Platforms, and their connections to existing ELIXIR Communities and international consortia. The article provides a preliminary blueprint for a Biodiversity Community in ELIXIR and is an appeal to identify and involve new stakeholders.
- B. Batut, F. E. Psomopoulos, A. Via, and P. Palagi, “Teaching and Hosting Galaxy training / Hands-on: Training techniques to enhance learner participation and engagement — usegalaxy.no.” 2023, [Online]. Available at: https://training.galaxyproject.org/training-material/topics/teaching/tutorials/learner_participation_engagement/tutorial.html.
- B. Batut, F. E. Psomopoulos, A. Via, P. Palagi, and C. Gallardo, “Contributing to the Galaxy Training Material / Hands-on: Principles of learning and how they apply to training and teaching — training.galaxyproject.org.” 2023, [Online]. Available at: https://training.galaxyproject.org/training-material/topics/contributing/tutorials/learning-principles/tutorial.html.
- B. Batut, F. E. Psomopoulos, A. Via, and P. Palagi, “Contributing to the Galaxy Training Material / Hands-on: Design and plan session, course, materials — training.galaxyproject.org.” 2023, [Online]. Available at: https://training.galaxyproject.org/training-material/topics/contributing/tutorials/design/tutorial.html.
- F. E. Psomopoulos and Gallantries, “Statistics and machine learning / Hands-on: Introduction to Machine Learning using R — training.galaxyproject.org.” 2023, [Online]. Available at: https://training.galaxyproject.org/training-material/topics/statistics/tutorials/intro-to-ml-with-r/tutorial.html.
- S.-C. Fragkouli, A. Agathangelidis, and F. E. Psomopoulos, “TP53 synthetic genomics data for benchmarking variant callers.” Jun. 2023, doi: 10.5281/zenodo.8095898.
This is a synthetic genomics dataset generated with NEAT for the gene TP53 for the use case of benchmarking somatic variant callers. The reports for all bam files where created using bam-readcount.
- G. Nilsonne, G. O’Neill, S. Dahle, V. Gaillard, J. Priess-Buchheit, S. Birgit, and F. Psomopoulos, “EOSC as an Enabler of Research Assessment Reform: Position Paper from Task Force on Research Careers, Recognition, and Credit.” Zenodo, 2023, doi: 10.5281/ZENODO.10417069.
This position paper presents recommendations from the TF RCRC on reforming research assessment to support researchers engaging with EOSC. The recommendations are aimed at the EOSC Partnership and EOSC-A (including the members of EOSC-A) to strategically contribute to the reform of research assessment with respect to EOSC and Open Science.
- F. Psomopoulos, F. Al-Shahrour, and C. van Gelder, “Establishing the EOSC4Cancer Network of Experts.” Zenodo, 2023, doi: 10.5281/ZENODO.8073708.
In the context of the WP5 “Training and capacity building for key national and international stakeholders”, a key activity is to unlock the expertise in handling cancer related data both within and outside of the EOSC4Cancer consortium, and bundle this in such a way that all researchers that are in need of training and support will be able to receive it. To this end, we aim to bring together a network of key cancer data support professionals, coming from within the respective RIs, relevant national initiatives and competency centers. These slides were used in the context of a 1-hour webinar on Monday, June 19th at 12:00-13:00 CEST, where we discussed with all relevant stakeholders both within the EOSC4Cancer project (representatives of all EOSC4Cancer partners across all WPs) and beyond (notably the Cancer Landscape Partnering Group and the Stakeholder Forum); A short overview of WP5 which aims to ensure that the biomedical researchers and healthcare professionals will have the knowledge and skills on how to best address their scientific questions and challenges by using the tools and infrastructure towards accelerating and optimising their research. Foreseen scope and membership of the network, including the potential activities envisioned throughout the lifecycle of EOSC4Cancer and beyond. Discuss ways of contributing to the network, aligned to your ongoing activities within EOSC4Cancer.
- F. Psomopoulos, F. Al-Shahrour, C. van Gelder, S. Morgan, M. Andrabi, K. Majcen, and F. Schoots, “A guidance document for the EOSC4Cancer learning pathway.” Zenodo, 2023, doi: 10.5281/ZENODO.10200523.
The EOSC4Cancer initiative addresses the pressing need for advanced research and infrastructure in tackling cancer, particularly in Europe. This document focuses on the WP5 initiative within EOSC4Cancer, which aims to empower clinicians and cancer researchers through capacity building. Emphasizing the importance of tailored training, the document outlines the design process for learning paths, including short-term courses and traditional postgraduate options. Key concepts such as learning outcomes and FAIR training are introduced, and the systematic mapping of skills and roles within the target audience is detailed. Illustrative examples from both within and beyond the project demonstrate the application of these principles. The document concludes with a list of practical tools and services for constructing effective learning paths. Importantly, this document is a living resource, reflecting ongoing efforts and intended to contribute to a dynamic learning environment beyond the EOSC4Cancer project, benefiting the broader scientific community.
- L. J. Castro, F. Beuttenmüller, Z. Chen, S. Efeoglu, D. Garijo, F. Psomopoulos, B. Serrano-Solano, K. B. Shiferaw, D. Solanki, B. Wentzel, and Y. Zhang, “Towards metadata for machine learning - Crosswalk tables.” Zenodo, 2023, doi: 10.5281/ZENODO.10407320.
Despite the existence of recommendations to report Machine Learning outputs and datasets (e.g., ML model cards, Dataset Cards1, etc.), there is currently no community-agreed metadata schema describing Machine Learning models (MLMs). Some communities have already expressed some interest in this regard, for instance, the Research Data Alliance (RDA) FAIR4ML Interest Group2, ELIXIR Machine Learning Focus Group and the National Research Data Infrastructure (NFDI) for Data Science and Artificial Intelligence (NFDI4DS), one of the NFDI consortia in Germany. A common ground for these three communities is the interest in creating a common metadata schema based on schema.org as the scientific community has already turned to it as a low-barrier gluing point. Initially independently but now as a joint force, these three communities are currently working on developing such a metadata schema. To work towards this end, the Semantic Technologies team (SemTec) at ZB MED Information Centre for Life Sciences (ZB MED) organized a 2-day hackathon on the 23rd and 24th of November 2023. During this hackathon, a group of 11 participants from 11 different European-based institutions worked on an initial set of crosswalks from the ML model cards in HuggingFace to four other related resources. ● BioImage Model Zoo (also including a mapping for Datasets) ● Schema.org
- L. J. Castro, F. Psomopoulos, B. Serrano-Solano, C. Sharma, K. B. Shiferaw, D. Solanki, and Y. Zhang, “Lifecycle for FAIR Machine Learning.” Zenodo, 2023, doi: 10.5281/ZENODO.10407265.
Despite the advances in Machine Learning Operations and the availability of variation of the Machine Learning lifecycle, there is none yet aligned to the Findable, Accessible, Interoperable and Reusable (FAIR) principles. Here we present our proposal of such a lifecycle, including an initial analysis on which and how the FAIR principles apply together with some additional information on reporting best practices and existing resources that could support the different phases in the lifecycle.
- R. M. Waterhouse, A.-F. Adam-Blondon, B. Balech, E. Barta, K. F. Heil, G. M. Hughes, L. S. Jermiin, M. Kalaš, J. Lanfear, E. Pafilis, A. C. Papageorgiou, F. Psomopoulos, N. Raes, J. Burgin, and T. Gabaldón, “The ELIXIR Biodiversity Community: Understanding short- and long-term changes in biodiversity,” F1000Research, vol. 12. F1000 Research Ltd, p. 499, May 2023, doi: 10.12688/f1000research.133724.1.
Biodiversity loss is now recognised as one of the major challenges for humankind to address over the next few decades. Unless major actions are taken, the sixth mass extinction will lead to catastrophic effects on the Earth’s biosphere and human health and well-being. ELIXIR can help address the technical challenges of biodiversity science, through leveraging its suite of services and expertise to enable data management and analysis activities that enhance our understanding of life on Earth and facilitate biodiversity preservation and restoration. This white paper, prepared by the ELIXIR Biodiversity Community, summarises the current status and responses, and presents a set of plans, both technical and community-oriented, that should both enhance how ELIXIR Services are applied in the biodiversity field and how ELIXIR builds connections across the many other infrastructures active in this area. We discuss the areas of highest priority, how they can be implemented in cooperation with the ELIXIR Platforms, and their connections to existing ELIXIR Communities and international consortia. The article provides a preliminary blueprint for a Biodiversity Community in ELIXIR and is an appeal to identify and involve new stakeholders
Short articles and Preprints
2022
- J. Tedds, S. Capella-Gutierrez, J. Clark-Casey, F. Coppens, G. Farrell, C. Van Gelder, B. Grüning, K. Heil, J. Lindvall, P. Maccallum, L. Matyska, F. Psomopoulos, P. Ruch, and S.-A. Sansone, “ELIXIR EOSC Strategy 2022.” Zenodo, 2022, doi: 10.5281/ZENODO.7120997.
Document detailing ELIXIR Europe’s strategy for engagement with the European Open Science Cloud (EOSC) as produced by the ELIXIR-EOSC Focus Group, 2022.
- N. Pechlivanis, A. Mitsigkolas, F. Psomopoulos, and E. Bosdriesz, “Assessing SARS-CoV-2 evolution through the analysis of emerging mutations.” Nov. 2022, doi: 10.7490/f1000research.1119191.1.
The number of studies on SARS-CoV-2 published on a daily basis is constantly increasing, in an attempt to understand and address the challenges posed by the pandemic in a better way. Most of these studies also include a phylogeny of SARS-CoV-2 as background context, always taking into consideration the latest data in order to construct an updated tree. However, some of these studies have also revealed the difficulties of inferring a reliable phylogeny. [13] have shown that reliable phylogeny is an inherently complex task due to a large number of highly similar sequences, given the relatively low number of mutations evident in each sequence. From this viewpoint, there is indeed a challenge and an opportunity in identifying the evolutionary history of the SARS-CoV-2 virus, in order to assist the phylogenetic analysis process as well as support researchers in keeping track of the virus and the course of its characteristic mutations, and in finding patterns of the emerging mutations themselves and the interactions between them. The research question is formulated as follows: Detecting new patterns of co-occurring mutations beyond the strain-specific / strain-defining ones, in SARS-CoV-2 data, through the application of ML methods.Going beyond the traditional phylogenetic approaches, we will be designing and implementing a clustering method that will effectively create a dendrogram of the involved sequences, based on a feature space defined on the present mutations, rather than the entire sequence. Ultimately, this ML method is tested out in sequences retrieved from public databases and validated using the available metadata as labels. The main goal of the project is to design, implement and evaluate software that will automatically detect and cluster relevant mutations, that could potentially be used to identify trends in emerging variants.
- E. A. Huerta et al., “FAIR for AI: An interdisciplinary, international, inclusive, and diverse community building perspective.” arXiv, 2022, doi: 10.48550/ARXIV.2210.08973.
A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022.
- N. P. Chue Hong et al., “FAIR Principles for Research Software (FAIR4RS Principles).” Zenodo, May 2022, doi: 10.15497/RDA00068.
To improve the sharing and reuse of research software, the FAIR for Research Software (FAIR4RS) Working Group has applied the FAIR Guiding Principles for scientific data management and stewardship to research software, bringing together existing and new community efforts. Many of the FAIR Guiding Principles can be directly applied to research software by treating software and data as similar digital research objects. However, specific characteristics of software — such as its executability, composite nature, and continuous evolution and versioning — make it necessary to revise and extend the principles. This document presents the first version of the FAIR Principles for Research Software (FAIR4RS Principles), and includes explanatory text to aid adoption. It is an outcome of the FAIR for Research Software Working Group (FAIR4RS WG) based on community consultations that started in 2019. The FAIR for Research Software Working Group was jointly convened as a Research Data Alliance (RDA) Working Group, FORCE11 Working Group, and Research Software Alliance (ReSA) Task Force. Going forward, the RDA Software Source Code Interest Group is the maintenance home for the principles. Concerns or queries about the principles can be raised at RDA plenary events organized by the SSC IG, where there may be opportunities for adopters to report back on progress. The full maintenance and retirement plan for the principles can be found on the RDA website.
2021
- M. C. Maniou, N. Pechlivanis, A. Togkousidis, and F. Psomopoulos, “k – taxatree: An alignment-free multi-label classification workflow for efficient taxonomic assignment of metagenomic NGS data.” Zenodo, 2021, doi: 10.5281/zenodo.5769944.
Annotating NGS sequences by assigning taxa labels is a key component for the majority of metagenomic studies, and is often a prerequisite in effectively assessing biodiversity in a given environment. In this work we introduce k-taxatree, an alignment-free machine learning method that enables robust assignment of taxonomic labels to short reads, utilizing a multi-label Random Forest approach as the underlying model. We demonstrate the effectiveness of the method by applying it to data from the V4 hypervariable region of 16S rRNA reads, retrieved from the Earth Microbiome Project, displaying accuracy scores over 95% in the validation set. The workflow has been fully developed in R and is freely available at https://github.com/BiodataAnalysisGroup/k-taxatree.
- N. Pechlivanis, M. Tsagiopoulou, M. C. Maniou, A. Togkousidis, E. Mouchtaropoulou, T. Chassalevris, S. Chaintoutis, C. Dovas, M. Petala, M. Kostoglou, T. Karapantsios, S. Laidou, E. Vlachonikola, A. Chatzidimitriou, A. Papadopoulos, N. Papaioannou, A. Argiriou, and F. Psomopoulos, “Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater a use-case in the metropolitan area of Thessaloniki, Greece.” Cold Spring Harbor Laboratory, Mar. 2021, doi: 10.1101/2021.03.17.21252673.
The SARS-CoV-2 pandemic represents an unprecedented global crisis necessitating novel approaches for, amongst others, early detection of emerging variants relating to the evolution and spread of the virus. Recently, the detection of SARS-CoV-2 RNA in wastewater has emerged as a useful tool to monitor the prevalence of the virus in the community. Here, we propose a novel methodology, called lineagespot, for the detection of SARS-CoV-2 lineages in wastewater samples using next-generation sequencing. Our proposed method was tested and evaluated using NGS data produced by the sequencing of three wastewater samples from the municipality of Thessaloniki, Greece, covering three distinct time periods. The results showed a clear identification of trends in the presence of SARS-CoV-2 mutations in sewage data, and allowed for a robust inference between the variants evident through our approach and the variants observed in patients from the same area time periods. Lineagespot is an open-source tool, implemented in R, and is freely available on GitHub.
- R. Alves, D. Bampalikis, L. J. Castro, J. M. Fernández, J. Harrow, M. Kuzak, E. Martin, F. Psomopoulos, and A. Via, “ELIXIR Software Management Plan for Life Sciences.” BioHackrXiv, 2021, doi: 10.37044/osf.io/k8znb.
Data Management Plans are now considered a key element of Open Science. They describe the data management life cycle for the data to be collected, processed and/or generated within the lifetime of a particular project or activity. A Software Manag ement Plan (SMP) plays the same role but for software. Beyond its management perspective, the main advantage of an SMP is that it both provides clear context to the software that is being developed and raises awareness. Although there are a few SMPs already available, most of them require significant technical knowledge to be effectively used. ELIXIR has developed a low-barrier SMP, specifically tailored for life science researchers, aligned to the FAIR Research Software principles. Starting from the Four Recommendations for Open Source Software, the ELIXIR SMP was iteratively refined by surveying the practices of the community and incorporating the received feedback. Currently available as a survey, future plans of the ELIXIR SMP include a human- and machine-readable version, that can be automatically queried and connected to relevant tools and metrics within the ELIXIR Tools ecosystem and beyond.
2020
- F. Ballesio, A. H. Bangash, D. Barradas Bautista, J. Barton, A. Guarracino, L. Heumos, A. Panoli, M. Pietrosanto, A. Togkousidis, P. Davis, and F. E. Psomopoulos, “Determining a novel feature-space for SARS-CoV-2 sequence data.” Center for Open Science, 2020, doi: 10.37044/osf.io/xt7gw.
The pandemicity & the ability of the SARS-COV-2 to reinfect a cured subject, among other damaging characteristics of it, took everybody by surprise. A global collaborative scientific effort was direly required to bring learned people from different niches of medicine & data science together. Such a platform was provided by COVID19 Virtual BioHackathon, organized from the 5th to the 11th of April, 2020, to ponder on the related pressing issues varying in their diversity from text mining to genomics. Under the "Machine learning" track, we determined optimal k-mer length for feature extraction, constructed continuous distributed representations for protein sequences to create phylogenetic trees in an alignment-free manner, and clustered predicted MHC class I and II binding affinity to aid in vaccine design. All the related work in available in a Github repository under an MIT license for future research.
- F. Psomopoulos, C. W. G. van Gelder, P. Kahlem, B. Leskošek, and J. Lindvall, “ELIXIR Training Platform Task 2: Gap analysis, training materials development and training delivery,” F1000Research, vol. 9. 2020, doi: 10.7490/f1000research.1117955.1.
The need for bioinformatics training evolves constantly, due to the continuous development of new technologies, as well as the increasing number of ELIXIR Services and Communities. The ELIXIR Training Platform, jointly with all other ELIXIR Platforms, Communities and Nodes, will continue to identify emerging gaps in training provision across Europe, and ensure that appropriate training solutions are developed and delivered, either by ELIXIR or the Nodes, in order to tackle such gaps. Main objective of this effort is to assess the bioinformatics training needs within the wider ELIXIR community, with a particular focus on capturing the perspective of the individual researchers across all ELIXIR Nodes, Platforms and Communities.
- M. Tsagiopoulou, N. Pechlivanis, and F. Psomopoulos, “InterTADs: Integration of Multi-Omics Data on Topological Associated Domains.” Aug. 2020, doi: 10.21203/rs.3.rs-54194/v1.
Background: The integration of multi-omics data can greatly facilitate the advancement of research in Life Sciences by providing new insights on how biological systems interact. However, there is currently no widespread procedure for a robust, efficient and meaningful multi-omics data integration; the approach presented here is a first attempt towards increasing the reliability of data discovery power compared to the processing of individual biodata sets. Results: Here, we proposed a high-speed framework, called InterTADs, for integrating multi-omics data from the same physical source (e.g. patient) taking into account the chromatin configuration of the genome, i.e. the topologically associating domains (TADs). The main concept of the proposed methodology is to create a single matrix with all different events (e.g. DNA methylation, expression, mutation) combined with their genome coordinates and the respective quantitative metrics after application of the appropriate scaling. The events are divided into their related TADs according to the chromosomal location and each TAD is evaluated for statistically significant differences between the groups of interest (e.g. normal cells vs cancer cells). Finally, several visualization approaches are available, including the mapping of the events on the chromosomal location of the TAD as well as the distribution of the counts within a given TAD across the different study groups. Conclusions: InterTADs provides a general framework for integrating multi omics data and relating them with the TADs. This could lead to the extraction of new biological insight of the examined case study. InterTADs is an open-source tool implemented in R and licensed under the MIT License. The source code is freely available.
- RDA COVID-19 Working Group, “Recommendations and Guidelines on data sharing,” Research Data Alliance. 2020, doi: 10.15497/rda00052.
This is the final version of the Recommendations and Guidelines from the RDA COVID19 Working Group, and has been endorsed through the official RDA process.
- S. Athanasiou et al., “National Plan for Open Science.” Zenodo, Jun. 2020, doi: 10.5281/zenodo.3908953.
This report proposes a series of goals, commitments, policies and actions for the adoption of Open Science in Greece. It is intended to serve as a reference point for national policy makers towards the establishment of a national strategy for Open Science, assist national organizations in embracing Open Science principles, and ensure national alignment with the European Open Science Cloud (EOSC).The report has been prepared by the ’Open Science Task Force’, a collaborative bottom-up initiative of eleven national academic & research organizations and twenty-six research infrastructures & civic initiatives.
2019
- A. Nicolaidis and F. Psomopoulos, “DNA coding and Gödel numbering.” 2019, doi: 10.48550/arXiv.1909.13574.
Evolution consists of distinct stages: cosmological, biological, linguistic. Since biology verges on natural sciences and linguistics, we expect that it shares structures and features from both forms of knowledge. Indeed, in DNA we encounter the biological atoms, the four nucleotide molecules. At the same time these four nucleotides may be considered as the letters of an alphabet. These four letters, through a genetic code, generate biological words, phrases, sentences (aminoacids, proteins, cells, living organisms). In this spirit we may consider equally well a DNA strand as a mathematical statement. Inspired by the work of Kurt Gödel, we attach to each DNA strand a Gödel’s number, a product of prime numbers raised to appropriate powers. To each DNA chain corresponds a single Gödel’s number G, and inversely given a Gödel’s number G, we can specify the DNA chain it stands for. Next, considering a single DNA strand composed of N bases, we study the statistical distribution of g, the logarithm of G. Our assumption is that the choice of the m-th term is random and with equal probability for the four possible outcomes. The experiment, to some extent, appears as throwing N times a four-faces die. Through the moment generating function we obtain the discrete and then the continuum distribution of g. There is an excellent agreement between our formalism and simulated data. At the end we compare our formalism to actual data, to specify the presence of traces of non-random dynamics.
- E. A. Becker et al., “datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019.” Jun. 2019, doi: 10.5281/zenodo.3260609.
2016
- E. Stergiadis, A. M. Kintsakis, F. E. Psomopoulos, and P. A. Mitkas, “A Scalable Grid Computing Framework for Extensible Phylogenetic Profile Construction,” in Artificial Intelligence Applications and Innovations, Cham, 2016, pp. 455–462, doi: 10.1007/978-3-319-44944-9_39.
Current research in Life Sciences without doubt has been established as a Big Data discipline. Beyond the expected domain-specific requirements, this perspective has put scalability as one of the most crucial aspects of any state-of-the-art bioinformatics framework. Sequence alignment and construction of phylogenetic profiles are common tasks evident in a wide range of life science analyses as, given an arbitrary big volume of genomes, they can provide useful insights on the functionality and relationships of the involved entities. This process is often a computational bottleneck in existing solutions, due to its inherent complexity. Our proposed distributed framework manages to perform both tasks with significant speed-up by employing Grid Computing resources provided by EGI in an efficient and optimal manner. The overall workflow is both fully automated, thus making it user friendly, and fully detached from the end-users terminal, since all computations take place on Grid worker nodes.
- F. E. Psomopoulos, A. M. Kintsakis, and P. A. Mitkas, “A pan-genome approach and application to species with photosynthetic capabilities,” Sep. 2016, doi: 10.7490/f1000research.1112964.1.
The abundance of genome data being produced by the new sequencing techniques is providing the opportunity to investigate gene diversity at a new level. A pan-genome analysis can provide the framework for estimating the genomic diversity of the data set at hand and give insights towards the understanding of its observed characteristics. Currently, there exist several tools for pan-genome studies, mostly focused on prokaryote genomes and their respective attributes. Here we provide a systematic approach for constructing the groups inherently associated with a pan-genome analysis, using the complete proteome data of photosynthetic genomes as the driving case study. As opposed to similar studies, the presented method requires a complete information system (i.e. complete genomes) in order to produce meaningful results. The method was applied to 95 genomes with photosynthetic capabilities, including cyanobacteria and green plants, as retrieved from UniProt and Plaza. Due to the significant computational requirements of the analysis, we utilized the Federated Cloud computing resources provided by the EGI infrastructure. The analysis ultimately produced 37,680 protein families, with a core genome comprising of 102 families. An investigation of the families’ distribution revealed two underlying but expected subsets, roughly corresponding to bacteria and eukaryotes. Finally, an automated functional annotation of the produced clusters, through assignment of PFAM domains to the participating protein sequences, allowed the identification of the key characteristics present in the core genome, as well as of selected multi-member families.
- F. E. Psomopoulos, E. Korpelainen, K. Mattila, and D. Scardaci, “Bioinformatics resources on EGI Federated Cloud,” Sep. 2016.
Data can be “big” for three reasons – often referred to as the three V’s; volume of data, velocity of processing the data, and variability of data sources. If any of these key features are present, then big-data tools are necessary, often combined with high network bandwidth and massive compute systems. As NGS technologies are revolutionizing life science research, established workflows in facilitating the first steps in data analysis are being increasingly employed. Cloud computing provides a robust and cost-efficient solution towards supporting the computational demands of such workflows. In particular, NGS data analysis tools are constantly becoming available as resources within EGI’s Federated Cloud. The European Grid Infrastructure (EGI) is the result of pioneering work that has, over the last decade, built a collaborative production infrastructure of uniform services through the federation of national resource providers that supports multi-disciplinary science across Europe and around the world. EGI currently supports an extensive list of services available for life sciences and has been working together with the community to implement further support. The EGI Federated Cloud (FedCloud), the latest infrastructure and technological offering of EGI, is a prime example of a flexible environment to support both discipline and use case through Big Data services. Finally, in addition to providing access to advanced tools and applications, e-infrastructures like EGI, provide the opportunity to create training tools for life science researchers and to create synergies between life sciences and ICT researchers, which is fundamental in moving research forward.
2015
- O. T. Vrousgou, F. E. Psomopoulos, and P. A. Mitkas, “A Grid-Enabled Modular Framework for Efficient Sequence Analysis Workflows,” in Engineering Applications of Neural Networks, Cham, 2015, pp. 47–56, doi: 10.1007/978-3-319-23983-5_5.
In the era of Big Data in Life Sciences, efficient processing and analysis of vast amounts of sequence data is becoming an ever daunting challenge. Among such analyses, sequence alignment is one of the most commonly used procedures, as it provides useful insights on the functionality and relationship of the involved entities. Sequence alignment is one of the most common computational bottlenecks in several bioinformatics workflows. We have designed and implemented a time-efficient distributed modular application for sequence alignment, phylogenetic profiling and clustering of protein sequences, by utilizing the European Grid Infrastructure. The optimal utilization of the Grid with regards to the respective modules, allowed us to achieve significant speedups to the order of 1400%.
- F. Psomopoulos, O. Vrousgou, and P. Mitkas, “Large-scale modular comparative genomics: the Grid approach,” Jul. 2015, doi: 10.7490/f1000research.1110127.1.
In the era of Big Data in Life Sciences, efficient process and analysis of vast amounts of sequence data is becoming an ever daunting challenge. Among such analyses, sequence alignment is one of the most commonly used procedures, as it provides useful insights on the functionality and relationship of the involved entities. At the same time however, it is one of the most common computational bottlenecks in several bioinformatics workflows, especially when combined with the construction of families and phylogenetic profiles. Current approaches in Life Science research favour the use of established workflows which have been proven to facilitate the first steps in data analysis (such as BLAST, MCL, Phylogenetic profiles, etc). However, such workflows are computationally expensive and usually do not scale well. The guiding requirements for our approach are: a) maximizing the efficiency of a given workflow using the computational resources provided by the European Grid Infrastructure (EGI http://www.egi.eu/ ), b) providing an automated approach and therefore a more user-friendly interface for researchers with no technical experience, and c) using the established (vanilla) applications and tools in order to maintain backwards compatibility and maintenance, which is a usual issue in most of the custom approaches. We have designed and implemented a time-efficient distributed modular application for sequence alignment, phylogenetic profiling and clustering of protein sequences, by utilizing the European Grid Infrastructure. Specifically, the application comprises three main components: a) BLAST alignment, b) construction of phylogenetic profiles based on the produced alignment scores, and c) clustering of entities using the MCL algorithm. These modules have been selected as they represent a common aspect of a vast majority of bioinformatics workflows. It is important to note that the modules can be combined independently, and ultimately provide 4 different modes of operation. 1. MCL clusters of the protein query and database sequences. The clustering criteria is the BLAST output (identity or e-value), based on the preference of the user 2. Phylogenetic profiles of each query sequence, where the genomes into consideration are the ones whose proteins form the database 3. MCL clusters of the protein query sequences and database genomes, and phylogenetic profiles. The MCL clustering criteria is the phylogenetic profiles. 4. This mode is essentially a combination of the output produced in modes 1 and 3. There is also a fifth mode that generates the same output as the fourth one, with the only difference being that the same file is used both as a database and a query. This is the case of an all-vs-all sequence comparison, widely used when performing a pangenome analysis. Our proposed framework proceeds with the distribution of both processes and data across the provided resources. The distribution is performed automatically, based on the selected mode as well as the data under study. The required input comprises of the following files: - two files containing the query protein sequences and the database protein sequences to be aligned, in FASTA format, which is a text-based format for representing nucleotide or peptide sequences, - a text file with the genomes whose protein sequences form the database file, and - a configuration file for the application to run on a specific mode. We have evaluated the application through several different scenarios, ranging from targeted investigations of enzymes participating in selected pathways against a custom database to produce functional groups, to large scale comparisons at the pangenome level. In all cases, the optimal utilization of the Grid with regards to the respective modules, allowed us to achieve significant speedup, in the order of 14x with respect to traditional approaches. Source code available at https://github.com/BioDAG/BPM