- A. C. Dimopoulos, K. Koukoutegos, F. E. Psomopoulos, and P. Moulos, “Combining Multiple RNA-Seq Data Analysis Algorithms Using Machine Learning Improves Differential Isoform Expression Analysis,” Methods and Protocols, vol. 4, no. 4, 2021, doi: 10.3390/mps4040068.
RNA sequencing has become the standard technique for high resolution genome-wide monitoring of gene expression. As such, it often comprises the first step towards understanding complex molecular mechanisms driving various phenotypes, spanning organ development to disease genesis, monitoring and progression. An advantage of RNA sequencing is its ability to capture complex transcriptomic events such as alternative splicing which results in alternate isoform abundance. At the same time, this advantage remains algorithmically and computationally challenging, especially with the emergence of even higher resolution technologies such as single-cell RNA sequencing. Although several algorithms have been proposed for the effective detection of differential isoform expression from RNA-Seq data, no widely accepted golden standards have been established. This fact is further compounded by the significant differences in the output of different algorithms when applied on the same data. In addition, many of the proposed algorithms remain scarce and poorly maintained. Driven by these challenges, we developed a novel integrative approach that effectively combines the most widely used algorithms for differential transcript and isoform analysis using state-of-the-art machine learning techniques. We demonstrate its usability by applying it on simulated data based on several organisms, and using several performance metrics; we conclude that our strategy outperforms the application of the individual algorithms. Finally, our approach is implemented as an R Shiny application, with the underlying data analysis pipelines also available as docker containers.
- M. Tsagiopoulou et al., “miRkit: R Framework Analyzing miRNA PCR Array Data,” BMC Research Notes, vol. 14, no. 376, Sep. 2021, doi: 10.1186/s13104-021-05788-1.
- I. Walsh et al., “DOME: recommendations for supervised machine learning validation in biology,” Nature Methods, Jul. 2021, doi: 10.1038/s41592-021-01205-4.
- S. Ntoufa et al., “RPS15 mutations rewire RNA translation in chronic lymphocytic leukemia,” Blood Advances, vol. 5, no. 13, pp. 2788–2792, Jul. 2021, doi: 10.1182/bloodadvances.2020001717.
- N. Pechlivanis, A. Togkousidis, M. Tsagiopoulou, S. Sgardelis, I. Kappas, and F. Psomopoulos, “A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data,” Frontiers in Genetics, vol. 12, May 2021, doi: 10.3389/fgene.2021.618170.
- M. Tsagiopoulou et al., “UMIc: A Preprocessing Method for UMI Deduplication and Reads Correction,” Frontiers in Genetics, vol. 12, May 2021, doi: 10.3389/fgene.2021.660366.
- K. Gemenetzi et al., “Higher-order immunoglobulin repertoire restrictions in CLL: the illustrative case of stereotyped subsets 2 and 169,” Blood, vol. 137, no. 14, pp. 1895–1904, Apr. 2021, doi: 10.1182/blood.2020005216.
- M. Velegraki et al., “Increased proportion and altered properties of intermediate monocytes in the peripheral blood of patients with lower risk Myelodysplastic Syndrome,” Blood Cells, Molecules, and Diseases, vol. 86, p. 102507, Feb. 2021, doi: 10.1016/j.bcmd.2020.102507.
- M. Gerousi et al., “The Calcitriol/Vitamin D Receptor System Regulates Key Immune Signaling Pathways in Chronic Lymphocytic Leukemia,” Cancers, vol. 13, no. 2, 2021, doi: 10.3390/cancers13020285.
It has been proposed that vitamin D may play a role in prevention and treatment of cancer while epidemiological studies have linked vitamin D insufficiency to adverse disease outcomes in various B cell malignancies, including chronic lymphocytic leukemia (CLL). In this study, we sought to obtain deeper biological insight into the role of vitamin D and its receptor (VDR) in the pathophysiology of CLL. To this end, we performed expression analysis of the vitamin D pathway molecules; complemented by RNA-Sequencing analysis in primary CLL cells that were treated in vitro with calcitriol, the biologically active form of vitamin D. In addition, we examined calcitriol effects ex vivo in CLL cells cultured in the presence of microenvironmental signals, namely anti-IgM/CD40L, or co-cultured with the supportive HS-5 cells; and, CLL cells from patients under ibrutinib treatment. Our study reports that the calcitriol/VDR system is functional in CLL regulating signaling pathways critical for cell survival and proliferation, including the TLR and PI3K/AKT pathways. Moreover, calcitriol action is likely independent of the microenvironmental signals in CLL, since it was not significantly affected when combined with anti-IgM/CD40L or in the context of the co-culture system. This finding was also supported by our finding of preserved calcitriol signaling capacity in CLL patients under ibrutinib treatment. Overall, our results indicate a relevant biological role for vitamin D in CLL pathophysiology and allude to the potential clinical utility of vitamin D supplementation in patients with CLL.
- A. Agathangelidis et al., “Infrequent ‘chronic lymphocytic leukemia-specific’ immunoglobulin stereotypes in aged individuals with or without low-count monoclonal B-cell lymphocytosis,” Haematologica, vol. 106, no. 4, pp. 1178–1181, Jun. 2020, doi: 10.3324/haematol.2020.247908.
- A. Vardi et al., “T-Cell Dynamics in Chronic Lymphocytic Leukemia under Different Treatment Modalities,” Clinical Cancer Research, vol. 26, no. 18, pp. 4958–4969, 2020, doi: 10.1158/1078-0432.CCR-19-3827.
Purpose: Using next-generation sequencing (NGS), we recently documented T-cell oligoclonality in treatment-naı̈ve chronic lymphocytic leukemia (CLL), with evidence indicating T-cell selection by restricted antigens.Experimental Design: Here, we sought to comprehensively assess T-cell repertoire changes during treatment in relation to (i) treatment type [fludarabine-cyclophosphamide-rituximab (FCR) versus ibrutinib (IB) versus rituximab-idelalisib (R-ID)], and (ii) clinical response, by combining NGS immunoprofiling, flow cytometry, and functional bioassays.Results: T-cell clonality significantly increased at (i) 3 months in the FCR and R-ID treatment groups, and (ii) over deepening clinical response in the R-ID group, with a similar trend detected in the IB group. Notably, in constrast to FCR that induced T-cell repertoire reconstitution, B-cell receptor signaling inhibitors (BcRi) preserved pretreatment clones. Extensive comparisons both within CLL as well as against T-cell receptor sequence databases showed little similarity with other entities, but instead revealed major clonotypes shared exclusively by patients with CLL, alluding to selection by conserved CLL-associated antigens. We then evaluated the functional effect of treatments on T cells and found that (i) R-ID upregulated the expression of activation markers in effector memory T cells, and (ii) both BcRi improved antitumor T-cell immune synapse formation, in marked contrast to FCR.Conclusions: Taken together, our NGS immunoprofiling data suggest that BcRi retain T-cell clones that may have developed against CLL-associated antigens. Phenotypic and immune synapse bioassays support a concurrent restoration of functionality, mostly evident for R-ID, arguably contributing to clinical response.
- C. C. Austin et al., “Fostering global data sharing: highlighting the recommendations of the Research Data Alliance COVID-19 working group [version 1; peer review: 1 approved, 2 approved with reservations],” Wellcome Open Research, vol. 5, no. 267, 2020, doi: 10.12688/wellcomeopenres.16378.1.
- M. T. Kotouza et al., “TRIP - T cell receptor/immunoglobulin profiler,” BMC Bioinformatics, vol. 21, no. 422, Sep. 2020, doi: 10.1186/s12859-020-03669-1.
- A.-C. Vagiona, M. A. Andrade-Navarro, F. Psomopoulos, and S. Petrakis, “Dynamics of a Protein Interaction Network Associated to the Aggregation of polyQ-Expanded Ataxin-1,” Genes, vol. 11, no. 10, p. 1129, Sep. 2020, doi: 10.3390/genes11101129.
- F. E. Psomopoulos, J. van Helden, C. Médigue, A. Chasapi, and C. A. Ouzounis, “Ancestral state reconstruction of metabolic pathways across pangenome ensembles,” 2020, doi: 10.1099/mgen.0.000429.
As genome sequencing efforts are unveiling the genetic diversity of the biosphere with an unprecedented speed, there is a need to accurately describe the structural and functional properties of groups of extant species whose genomes have been sequenced, as well as their inferred ancestors, at any given taxonomic level of their phylogeny. Elaborate approaches for the reconstruction of ancestral states at the sequence level have been developed, subsequently augmented by methods based on gene content. While these approaches of sequence or gene-content reconstruction have been successfully deployed, there has been less progress on the explicit inference of functional properties of ancestral genomes, in terms of metabolic pathways and other cellular processes. Herein, we describe PathTrace, an efficient algorithm for parsimony-based reconstructions of the evolutionary history of individual metabolic pathways, pivotal representations of key functional modules of cellular function. The algorithm is implemented as a five-step process through which pathways are represented as fuzzy vectors, where each enzyme is associated with a taxonomic conservation value derived from the phylogenetic profile of its protein sequence. The method is evaluated with a selected benchmark set of pathways against collections of genome sequences from key data resources. By deploying a pangenome-driven approach for pathway sets, we demonstrate that the inferred patterns are largely insensitive to noise, as opposed to gene-content reconstruction methods. In addition, the resulting reconstructions are closely correlated with the evolutionary distance of the taxa under study, suggesting that a diligent selection of target pangenomes is essential for maintaining cohesiveness of the method and consistency of the inference, serving as an internal control for an arbitrary selection of queries. The PathTrace method is a first step towards the large-scale analysis of metabolic pathway evolution and our deeper understanding of functional relationships reflected in emerging pangenome collections.
- K. T. Gurwitz et al., “A framework to assess the quality and impact of bioinformatics training across ELIXIR,” PLOS Computational Biology, vol. 16, no. 7, pp. 1–12, Jul. 2020, doi: 10.1371/journal.pcbi.1007976.
ELIXIR is a pan-European intergovernmental organisation for life science that aims to coordinate bioinformatics resources in a single infrastructure across Europe; bioinformatics training is central to its strategy, which aims to develop a training community that spans all ELIXIR member states. In an evidence-based approach for strengthening bioinformatics training programmes across Europe, the ELIXIR Training Platform, led by the ELIXIR EXCELERATE Quality and Impact Assessment Subtask in collaboration with the ELIXIR Training Coordinators Group, has implemented an assessment strategy to measure quality and impact of its entire training portfolio. Here, we present ELIXIR’s framework for assessing training quality and impact, which includes the following: specifying assessment aims, determining what data to collect in order to address these aims, and our strategy for centralised data collection to allow for ELIXIR-wide analyses. In addition, we present an overview of the ELIXIR training data collected over the past 4 years. We highlight the importance of a coordinated and consistent data collection approach and the relevance of defining specific metrics and answer scales for consortium-wide analyses as well as for comparison of data across iterations of the same course.
- L. Garcia et al., “Ten simple rules for making training materials FAIR,” PLOS Computational Biology, vol. 16, no. 5, pp. 1–9, May 2020, doi: 10.1371/journal.pcbi.1007854.
Author summary Everything we do today is becoming more and more reliant on the use of computers. The field of biology is no exception; but most biologists receive little or no formal preparation for the increasingly computational aspects of their discipline. In consequence, informal training courses are often needed to plug the gaps; and the demand for such training is growing worldwide. To meet this demand, some training programs are being expanded, and new ones are being developed. Key to both scenarios is the creation of new course materials. Rather than starting from scratch, however, it’s sometimes possible to repurpose materials that already exist. Yet finding suitable materials online can be difficult: They’re often widely scattered across the internet or hidden in their home institutions, with no systematic way to find them. This is a common problem for all digital objects. The scientific community has attempted to address this issue by developing a set of rules (which have been called the Findable, Accessible, Interoperable and Reusable [FAIR] principles) to make such objects more findable and reusable. Here, we show how to apply these rules to help make training materials easier to find, (re)use, and adapt, for the benefit of all.
- L. Stamatia et al., “Nuclear inclusions of pathogenic ataxin-1 induce oxidative stress and perturb the protein synthesis machinery,” Redox Biology, vol. 32, p. 101458, 2020, doi: 10.1016/j.redox.2020.101458.
Spinocerebellar ataxia type-1 (SCA1) is caused by an abnormally expanded polyglutamine (polyQ) tract in ataxin-1. These expansions are responsible for protein misfolding and self-assembly into intranuclear inclusion bodies (IIBs) that are somehow linked to neuronal death. However, owing to lack of a suitable cellular model, the downstream consequences of IIB formation are yet to be resolved. Here, we describe a nuclear protein aggregation model of pathogenic human ataxin-1 and characterize IIB effects. Using an inducible Sleeping Beauty transposon system, we overexpressed the ATXN1(Q82) gene in human mesenchymal stem cells that are resistant to the early cytotoxic effects caused by the expression of the mutant protein. We characterized the structure and the protein composition of insoluble polyQ IIBs which gradually occupy the nuclei and are responsible for the generation of reactive oxygen species. In response to their formation, our transcriptome analysis reveals a cerebellum-specific perturbed protein interaction network, primarily affecting protein synthesis. We propose that insoluble polyQ IIBs cause oxidative and nucleolar stress and affect the assembly of the ribosome by capturing or down-regulating essential components. The inducible cell system can be utilized to decipher the cellular consequences of polyQ protein aggregation. Our strategy provides a broadly applicable methodology for studying polyQ diseases.
- M. Tsagiopoulou et al., “Chronic lymphocytic leukemias with trisomy 12 show a distinct DNA methylation profile linked to altered chromatin activation,” Haematologica, 2020, doi: 10.3324/haematol.2019.240721.
- A. Agathangelidis et al., “High-throughput analysis of the T cell receptor gene repertoire in low-count monoclonal B cell lymphocytosis reveals a distinct profile from chronic lymphocytic leukemia,” Haematologica, 2020, doi: 10.3324/haematol.2019.221275.
- E. Gavriilaki et al., “Pretransplant Genetic Susceptibility: Clinical Relevance in Transplant-Associated Thrombotic Microangiopathy,” Thrombosis and Haemostasis, vol. 120, no. 04, pp. 638–646, 2020, doi: 10.1055/s-0040-1702225.
- M. T. Kotouza, F. E. Psomopoulos, and P. A. Mitkas, “A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures,” Journal of Cloud Computing, vol. 9, no. 2, pp. 1–17, 2020, doi: 10.1186/s13677-019-0150-y.
Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.
- A.-L. Lamprecht et al., “Towards FAIR principles for research software,” Data Science, vol. 2, no. 2, pp. 1–23, 2019, doi: 10.3233/DS-190026.
The FAIR Guiding Principles, published in 2016, aim to improve the findability, accessibility, interoperability and reusability of digital research objects for both humans and machines. Until now the FAIR principles have been mostly applied to research data. The ideas behind these principles are, however, also directly relevant to research software. Hence there is a distinct need to explore how the FAIR principles can be applied to software. In this work, we aim to summarize the current status of the debate around FAIR and software, as basis for the development of community-agreed principles for FAIR research software in the future. We discuss what makes software different from data with regard to the application of the FAIR principles, and which desired characteristics of research software go beyond FAIR. Then we present an analysis of where the existing principles can directly be applied to software, where they need to be adapted or reinterpreted, and where the definition of additional principles is required. Here interoperability has proven to be the most challenging principle, calling for particular attention in future discussions. Finally, we outline next steps on the way towards definite FAIR principles for research software.
- M. Kuzak, J. Harrow, P. A. Martinez, F. E. Psomopoulos, and A. Via, “ELIXIR Europe on the Road to Sustainable Research Software,” Biodiversity Information Science and Standards, vol. 3, p. e37677, 2019, doi: 10.3897/biss.3.37677.
ELIXIR (ELIXIR Europe 2019a) is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR’s activities are divided into the following five areas: Data, Tools, Interoperability, Compute and Training, each known as “platform”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources. The Software Development Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, and promoting information standards and best practices relevant to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software (Jiménez et al. 2017) and the Top 10 metrics for recommended life science software practices (Artaza et al. 2016). The 4OSS simple recommendations are as follows: (1) Develop a publicly accessible open source code from day one, (2) Make software easy to discover by providing software metadata via a popular community registry, (3) Adopt a license and comply with the licenses of third-party dependencies, and (4) Have clear and transparent contribution, governance and communication processes. In order to encourage researchers and developers to adopt the 4OSS recommendations and build FAIR (Findable, Accessible, Interoperable and Reusable) software, the best practices group, in partnership with the ELIXIR Training platform, The Carpentries (Carpentries 2019, ELIXIR Europe 2019b), and other communities, are creating a collection of training materials (Kuzak et al. 2019). The next step is to adopt, promote, and recognise these information standards and best practices. The group will address this by (i) developing comprehensive guidelines for software curation, (ii) through training researchers and developers towards the adoption of software best practices and (iii) improvement of the usability of Tools Platform products. Additionally, a direct outcome of this task will be a software management plan template, connected to a concise description of the guidelines for open research software; and production of a white paper for the software development management plan for ELIXIR, which can be consequently used to produce training materials. We will work with the newly formed ReSA (Research Software Alliance) to facilitate the adoption of this plan for the broader community.
- F. F. Parlapani et al., “Bacterial communities and potential spoilage markers of whole blue crab (Callinectes sapidus) stored under commercial simulated conditions,” Food Microbiology, vol. 82, pp. 325–333, 2019, doi: 10.1016/j.fm.2019.03.011.
Bacterial communities composition using 16S Next Generation Sequencing (NGS) and Volatile Organic Compounds (VOCs) profile of whole blue crabs (Callinectes sapidus) stored at 4 and 10 °C (proper and abuse temperature) simulating real storage conditions were performed. Conventional microbiological and chemical analyses (Total Volatile Base-Nitrogen/TVB-N and Trimethylamine-Nitrogen/TMA-N) were also carried out. The rejection time point was 10 and 6 days for the whole crabs stored at 4 and 10 °C, respectively, as determined by development of unpleasant odors, which coincided with crabs death. Initially, the Aerobic Plate Count (APC) was 4.87 log cfu/g and increased by 3 logs at the rejection time. The 16S NGS analysis of DNA extracted directly from the crab tissue (culture-independent method), showed that the initial microbiota of the blue crab mainly consisted of Candidatus Bacilloplasma, while potential pathogens e.g. Listeria monocytogenes, Pseudomonas aeruginosa and Acinetobacter baumannii, were also found. At the rejection point, bacteria of Rhodobacteraceae family (52%) and Vibrio spp. (40.2%) dominated at 4 and 10 °C, respectively. TVB-N and TMA-N also increased, reaching higher values at higher storage temperature. The relative concentrations of some VOCs such as 1-octen-3-ol, trans-2-octenal, trans,trans-2,4-heptadienal, 2-butanone, 3-butanone, 2-heptanone, ethyl isobutyrate, ethyl acetate, ethyl-2-methylbutyrate, ethyl isovalerate, hexanoic acid ethyl ester and indole, exhibited an increasing trend during crab storage, making them promising spoilage markers. The composition of microbial communities at different storage temperatures was examined by 16S amplicon meta-barcoding analysis. This kind of analysis in conjugation with the volatile profile can be used to explore the microbiological quality and further assist towards the application of the appropriate strategies to extend crab shelf-life and protect consumer’s health.
- A. M. Kintsakis, F. E. Psomopoulos, and P. A. Mitkas, “Reinforcement Learning based scheduling in a workflow management system,” Engineering Applications of Artificial Intelligence, vol. 81, pp. 94–106, 2019, doi: 10.1016/j.engappai.2019.02.013.
Any computational process from simple data analytics tasks to training a machine learning model can be described by a workflow. Many workflow management systems (WMS) exist that undertake the task of scheduling workflows across distributed computational resources. In this work, we introduce a WMS that leverages machine learning to predict workflow task runtime and the probability of failure of task assignments to execution sites. The expected runtime of workflow tasks can be used to approximate the weight of the workflow graph branches in respect to the total workflow workload and the ability to anticipate task failures can discourage task assignments that are unlikely to succeed. We demonstrate that the proposed machine learning models can lead to significantly more informed scheduling decisions that minimize task failures and utilize execution sites more efficiently, thus leading to reduced workflow runtime. Additionally, we train a modified sequence-to-sequence neural network architecture via reinforcement learning to perform scheduling decisions as part of a WMS. Our approach introduces a WMS that can drastically improve its scheduling performance by independently learning over time, without external intervention or reliance on any specific heuristic or optimization technique. Finally, we test our approach in real-world scenarios utilizing computationally demanding and data intensive workflows and evaluate its performance against existing scheduling methodologies traditionally used in WMSes. The performance evaluation outcome confirms that the proposed approach significantly outperforms the other scheduling algorithms in a consistent manner and achieves the best execution runtime with the lowest number of failed tasks and communication costs.
- A. Agathangelidis, F. Psomopoulos, and K. Stamatopoulos, “Stereotyped B Cell Receptor Immunoglobulins in B Cell Lymphomas,” Methods in Molecular Biology: "Lymphoma: Methods and Protocols", pp. 139–155, 2019, doi: 10.1007/978-1-4939-9151-8_7.
Comprehensive analysis of the clonotypic B cell receptor immunoglobulin (BcR IG) gene rearrangement sequences in patients with mature B cell neoplasms has led to the identification of significant repertoire restrictions, culminating in the discovery of subsets of patients expressing highly similar, stereotyped BcR IG. This finding strongly supports selection by common epitopes or classes of structurally similar epitopes in the ontogeny of these tumors. BcR IG stereotypy was initially described in chronic lymphocytic leukemia (CLL), where the stereotyped fraction of the disease accounts for a remarkable one-third of patients. However, subsequent studies showed that stereotyped BcR IG are also present in other neoplasms of mature B cells, including mantle cell lymphoma (MCL) and splenic marginal zone lymphoma (SMZL). Subsequent cross-entity comparisons led to the conclusion that stereotyped IG are mostly “disease-specific,” implicating distinct immunopathogenetic processes. Interestingly, mounting evidence suggests that a molecular subclassification of lymphomas based on BcR IG stereotypy is biologically and clinically relevant. Indeed, particularly in CLL, patients assigned to the same subset due to expressing a particular stereotyped BcR IG display remarkably consistent biological background and clinical course, at least for major and well-studied subsets. Thus, the robust assignment to stereotyped subsets may assist in the identification of mechanisms underlying disease onset and progression, while also refining risk stratification. In this book chapter, we provide an overview of the recent BcR IG stereotypy studies in mature B cell malignancies and outline previous and current methodological approaches used for the identification of stereotyped IG.
- M. Wu, F. Psomopoulos, S. J. Khalsa, and A. de Waard, “Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories,” Data Science Journal, vol. 18, no. 1, p. 13, 2019, doi: 10.5334/dsj-2019-003.
As data repositories make more data openly available it becomes challenging for researchers to find what they need either from a repository or through web search engines. This study attempts to investigate data users’ requirements and the role that data repositories can play in supporting data discoverability by meeting those requirements. We collected 79 data discovery use cases (or data search scenarios), from which we derived nine functional requirements for data repositories through qualitative analysis. We then applied usability heuristic evaluation and expert review methods to identify best practices that data repositories can implement to meet each functional requirement. We propose the following ten recommendations for data repository operators to consider for improving data discoverability and user’s data search experience: 1. Provide a range of query interfaces to accommodate various data search behaviours. 2. Provide multiple access points to find data. 3. Make it easier for researchers to judge relevance, accessibility and reusability of a data collection from a search summary. 4. Make individual metadata records readable and analysable. 5. Enable sharing and downloading of bibliographic references. 6. Expose data usage statistics. 7. Strive for consistency with other repositories. 8. Identify and aggregate metadata records that describe the same data object. 9. Make metadata records easily indexed and searchable by major web search engines. 10. Follow API search standards and community adopted vocabularies for interoperability.
Conferences and Announcements
- N. Pechlivanis, A. Togkousidis, M. C. Maniou, M. Tsagiopoulou, and F. Psomopoulos, “Developing a novel feature space for sequence data analysis; a use-case on SARS-CoV-2 data,” 2021, doi: 10.5281/ZENODO.4897477.
- D. S. Katz et al., “Toward defining and implementing FAIR for research software,” in AGU Fall Meeting Abstracts, Dec. 2020, vol. 2020, pp. IN037–01.
- K. Gemenetzi et al., “Truly unmutated IGHV-IGHD-IGHJ gene rearrangements in CLL: do they really exist?,” in LEUKEMIA & LYMPHOMA, 2020, vol. 61, pp. 212–213.
- M. T. Kotouza, F. E. Psomopoulos, and P. A. Mitkas, “A Dockerized String Analysis Workflow for Big Data,” in 23rd European Conference on Advances in Databases and Information Systems, ASBIS 2019, Bled, Slovenia, September 8-11, 2019, 2019, pp. 564–569, doi: 10.1007/978-3-030-30278-8_55.
- A. Vardi et al., “PS1131 High-Throughput B-Cell immunoprofiling at diagnosis and relapse offers further evidence of functional selection throughout the natural history of chronic lymphocytic leukemia,” in HemaSphere, 2019, vol. 3, no. S1, p. 512, doi: 10.1097/01.HS9.0000562808.48237.52.
- K. Gemenetzi et al., “VH CDR3-Focused Somatic Hypermutation in CLL IGHV-IGHD-IGHJ Gene Rearrangements with 100\% IGHV Germline Identity,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 4277–4277, doi: 10.1182/blood-2019-127979.
Classification of patients with chronic lymphocytic leukemia (CLL) based on the immunoglobulin heavy variable (IGHV) gene somatic hypermutation (SHM) status has established predictive and prognostic relevance. The SHM status is assessed based on the number of mutations within the sequence of the rearranged IGHV gene excluding the VH CDR3. This is mostly due to the difficulty in discriminating actual SHM from random nucleotides added between the recombined IGHV, IGHD and IGHJ genes. Hence, this approach may underestimate the true impact of SHM, in fact overlooking the most critical region for antigen-antibody interactions i.e. the VH CDR3. Relevant to mention in this respect, studies from our group in CLL with mutated IGHV genes (M-CLL), particularly subset #4, have revealed considerable intra-VH CDR3 diversity attributed to ongoing SHM.Prompted by these findings, here we investigated whether SHM may also be present in cases bearing ’truly unmutated’ IGHV genes (i.e. 100\% germline identity across VH FR1-VH FR3), focusing on two well characterized stereotyped subsets i.e. subset #1 (IGHV clan I/IGHD6-19/IGHJ4) and subset #6 (IGHV1-69/IGHD3-16/IGHJ3). These subsets carry germline-encoded amino acid (aa) motifs within the VH CDR3, namely QWL and YDYVWGSY, originating from the IGHD6-19 and IGHD3-16 gene, respectively. However, in both subsets, cases exist with variations in these motifs that could potentially represent SHM.The present study included 12 subset #1 and 5 subset #6 patients with clonotypic IGHV genes lacking any SHM (100\% germline identity). IGHV-IGHD-IGHJ gene rearrangements were RT-PCR amplified by subgroup-specific leader primers and a high-fidelity polymerase in order to ensure high data quality. RT-PCR products were subjected to paired-end NGS on the MiSeq platform. Sequence annotation was performed with IMGT/HighV-QUEST and metadata analysis was undertaken using an in-house purpose-built bioinformatics pipeline. Rearrangements with the same IGHV gene and identical VH CDR3 aa sequences were defined as clonotypes.Overall, we obtained 1,570,668 productive reads with V-region identity 99-100\%; of these, 1,232,958 (mean: 102,746, range: 20,796-242,519) concerned subset #1 while 337,710 (mean: 67,542, range: 50,403-79,683) concerned subset #6. On average, 64.4\% (range: 1.7-77.5\%) of subset #1 reads and 49.2\% (range: 0.7-70\%) of subset #6 reads corresponded to rearrangements with IGHV genes lacking any SHM (100\% germline identity). Clonotype computation revealed 1,831 and 1,048 unique clonotypes for subset #1 and #6, respectively. Subset #1 displayed a mean of 157 distinct clonotypes per sample (range: 74-267), with the dominant clonotype having a mean frequency of 96.9\% (range: 96-98.2\%). Of note, 44 clonotypes were shared between different patients (albeit at varying frequencies), including the dominant clonotype of 11/12 cases, which was present in 2-6 additional subset #1 patients. Subset #6 cases carried a higher number of distinct clonotypes per sample (mean: 219, range: 189-243) while the dominant clonotype had a mean frequency of 95.6\% (range: 94.5-96.5\%). Shared clonotypes (n=30) were identified also in subset #6 and the dominant clonotype of each subset #6 case was present in 3-5 additional subset #6 patients. Focusing on the VH CDR3, in particular the IGHD-encoded part, the following observations were made: (1) in both subsets, extensive intra-VH CDR3 variation was detected at certain positions within the IGHD gene; (2) in most cases, the observed aa substitutions were conservative i.e. concerned aa sharing similar physicochemical properties. Particularly noteworthy in this respect were the observations in subset #6 that: (i) the valine residue (V) in the D-derived YDYVWGSY motif was very frequently mutated to another aliphatic residue (A, I, L); (ii) in cases were the predominant clonotype carried I (also in the Sanger-derived sequence), several minor clonotypes carried the germline-encoded V, compelling evidence that the observed substitution concerned true SHM.In conclusion, we provide immunogenetic evidence for intra-VH CDR3 variations, very likely attributed to SHM, in CLL patients carrying ’truly unmutated’ IGHV genes. While the prognostic/predictive relevance of this observation is beyond the scope of the present work, our findings highlight the possible need to reappraise definitions (’semantics’) regarding SHM status in CLL.Stamatopoulos:Janssen: Honoraria, Research Funding; Abbvie: Honoraria, Research Funding. Chatzidimitriou:Janssen: Honoraria.
- M. Gerousi et al., “Functional Calcitriol/Vitamin D Receptor Signaling in Chronic Lymphocytic Leukemia,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 3019–3019, doi: 10.1182/blood-2019-127910.
Calcitriol, the biologically active form of vitamin D, modulates a plethora of cellular processes following its receptor ligation, namely the vitamin D receptor (VDR), a nuclear transcription factor that regulates the transcription of diverse genes. It has been proposed that vitamin D may play a role in prevention and treatment of cancer while epidemiological studies have linked vitamin D insufficiency to adverse disease outcome in chronic lymphocytic leukemia (CLL). Recently, we reported that VDR is functional in CLL cells after calcitriol supplementation, as well as after stimulation through both the calcitriol/VDR signaling system and other prosurvival pathways triggered from the tumor microenvironment. In this study, we aimed at investigating key molecules and signaling pathways that are altered after calcitriol treatment and are known to play a relevant role in CLL pathophysiology.CD19+ primary CLL cells were negatively selected from peripheral blood samples of patients that were treatment naïve at the time of sample collection. CLL cells were cultured in vitro with calcitriol or co-cultured with the HS-5 mesenchymal cell line for 24 hours. VDR+, CYP24A1+, phospho-ERK+ and phospho-NF-κB p65+ cells were determined by Flow Cytometry (FC). Total RNA was extracted from calcitriol-treated and non-treated CLL cells, while mRNA selection was performed using NEBNext Poly(A) mRNA Magnetic Isolation Module. Library preparation for RNA-Sequencing (RNA-Seq) analysis was conducted with the NEBNext Ultra II Directional RNA Library Prep Kit. The libraries were paired-end sequenced on the NextSeq 500 Illumina platform. Differential expression analysis was performed using DESeq2; genes with log2FC\>|1| and P≤0.05 were considered as differentially expressed.RNA-Seq analysis (n=6) confirmed our previous findings that the CYP24A1 gene is significantly upregulated by calcitriol, being the top upregulated gene, whereas the VDR gene remains unaffected by this treatment. Overall, 85 genes were differentially expressed in unstimulated versus calcitriol-treated cells, of which 28 were overexpressed in the latter thus contrasting the remaining 57 which showed the opposite pattern. Pathway enrichment and gene ontology (GO) analysis of the differentially expressed genes revealed significant enrichment in PI3K-Akt pathway and Toll-like receptor cascades, as well as in vitamin D metabolism and inflammatory response pathways. Additionally, flow cytometric analysis showed that calcitriol-treated CLL cells displayed increased pERKlevels (FD=1.3, p\<0.05) and, in contrast decreased pNF-κBlevels (FD=2.7, p\<0.05), highlighting active VDR signaling in CLL. Aiming at placing our findings in a more physiological context, we co-cultured CLL cells with the HS-5 cell line. Based on our previous finding that co-cultured CLL cells showed induced CYP24A1 levels, we evaluated pNF-κB expression. pNF-κB levels were found to be increased in co-cultured CLL cells (FD=4.2, p\<0.05), while the addition of calcitriol downregulated pNF-κB (FD=1.5, p\<0.05). Moreover, ex vivo calcitriol exposure of CLL cells from patients under ibrutinib treatment (at baseline, +1 and +3-6 months, n=7) resulted in significant upregulation of pERK (FD=1.6, p\<0.01; FD=1.4, p\<0.01; FD=1.9, p\<0.01; for each timepoint respectively) but, significant downregulation of pNF-κΒ (FD=3.4, p\<0.01; FD=3, p\<0.05; FD=2.3, p\<0.05; for each timepoint respectively), indicating preserved calcitriol/VDR signaling capacity.In conclusion, we provide evidence that the calcitriol/VDR system is active in CLL, modulating NF-κB and MAPK signaling as well as the expression of the CYP24A1 target gene. This observation is further supported by RNA-Seq analysis that confirms CYP24A1 upregulation and highlights new signaling pathways that need to be validated. Interestingly, the calcitriol/VDR system appears relatively unaffected by either stimulation or inhibition (ibrutinib) of microenvironmental signals that promote CLL cell survival and/or proliferation, indicating context-independent signaling capacity.Kotsianidis:Celgene: Research Funding. Stamatopoulos:Janssen: Honoraria, Research Funding; Abbvie: Honoraria, Research Funding.
- K. Gemenetzi et al., “Higher Order Restrictions of the Immunoglobulin Repertoire in CLL: The Illustrative Case of Stereotyped Subsets #2 and #169,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 5453–5453, doi: 10.1182/blood-2019-128017.
Stereotyped subset #2 (IGHV3-21/IGLV3-21) is the largest subset in CLL ( 3\% of all patients). Membership in subset #2 is clinically relevant since these patients experience an aggressive disease irrespective of the somatic hypermutation (SHM) status of the clonotypic immunoglobulin heavy variable (IGHV) gene. Low-throughput evidence suggests that stereotyped subset #169, a minor CLL subset ( 0.2\% of all CLL), resembles subset #2 at the immunogenetic level. More specifically: (i) the clonotypic heavy chain (HC) of subset #169 is encoded by the IGHV3-48 gene which is closely related to the IGHV3-21 gene; (ii) both subsets carry VH CDR3s comprising 9-amino acids (aa) with a conserved aspartic acid (D) at VH CDR3 position 3; (iii) both subsets bear light chains (LC) encoded by the IGLV3-21 gene with a restricted VL CDR3; and, (iv) both subsets have borderline SHM status. Here we comprehensively assessed the ontogenetic relationship between CLL subsets #2 and #169 by analyzing their immunogenetic signatures. Utilizing next-generation sequencing (NGS) we studied the HC and LC gene rearrangements of 6 subset #169 patients and 20 subset #2 cases. In brief, IGHV-IGHD-IGHJ and IGLV-IGLJ gene rearrangements were RT-PCR amplified using subgroup-specific leader primers as well as IGHJ and IGLC primers, respectively. Libraries were sequenced on the MiSeq Illumina instrument. IG sequence annotation was performed with IMGT/HighV-QUEST and metadata analysis conducted using an in-house, validated bioinformatics pipeline. Rearrangements with identical CDR3 aa sequences were herein defined as clonotypes, whereas clonotypes with different aa substitutions within the V-domain were defined as subclones. For the HC analysis of subset #169, we obtained 894,849 productive sequences (mean: 127,836, range: 87,509-208,019). On average, each analyzed sample carried 54 clonotypes (range: 44-68); the dominant clonotype had a mean frequency of 99.1\% (range: 98.8-99.2\%) and displayed considerable intraclonal heterogeneity with a mean of 2,641 subclones/sample (range: 1,566-6,533). For the LCs of subset #169, we obtained 2,096,728 productive sequences (mean: 299,533, range: 186,637-389,258). LCs carried a higher number of distinct clonotypes/sample compared to their partner HCs (mean: 148, range: 110-205); the dominant clonotype had a mean frequency of 98.1\% (range: 97.2-98.6\%). Intraclonal heterogeneity was also observed in the LCs, with a mean of 6,325 subclones/sample (range: 4,651-11,444), hence more pronounced than in their partner HCs. Viewing each of the cumulative VH and VL CDR3 sequence datasets as a single entity branching through diversification enabled the identification of common sequences. In particular, 2 VH clonotypes were present in 3/6 cases, while a single VL clonotype was present in all 6 cases, albeit at varying frequencies; interestingly, this VL CDR3 sequence was also detected in all subset #2 cases, underscoring the molecular similarities between the two subsets. Focusing on SHM, the following observations were made: (i) the frequent 3-nucleotide (AGT) deletion evidenced in the VH CDR2 of subset #2 (leading to the deletion of one of 5 consecutive serine residues) was also detected in all subset #169 cases at subclonal level (average: 6\% per sample, range: 0.1-10.8\%); of note, the 5-serine stretch is also present in the germline VH CDR2 of the IGHV3-48 gene; (ii) the R-to-G substitution at the VL-CL linker, a ubiquitous SHM in subset #2 and previously reported as critical for IG self-association leading to cell autonomous signaling in this subset, was present in all subset #169 samples as a clonal event with a mean frequency of 98.3\%; and, finally, (iii) the S-to-G substitution at position 6 of the VL CDR3, present in all subset #2 cases (mean : 44.2\% ,range: 6.3-87\%), was also found in all #169 samples, representing a clonal event in 1 case (97.2\% of all clonotypes) and a subclonal event in the remaining 5 cases (mean: 0.6\%, range: 0.4-1.1\%). In conclusion, the present high-throughput sequencing data cements the immunogenetic relatedness of CLL stereotyped subsets #2 and #169, further highlighting the role of antigen selection throughout their natural history. These findings also argue for a similar pathophysiology for these subsets that could also be reflected in a similar clonal behavior, with implications for risk stratification.Sutton:Abbvie: Honoraria; Gilead: Honoraria; Janssen: Honoraria. Stamatopoulos:Abbvie: Honoraria, Research Funding; Janssen: Honoraria, Research Funding. Chatzidimitriou:Janssen: Honoraria.
- M. Tsagiopoulou et al., “Genome-Wide Histone Acetylation Profiling in Chronic Lymphocytic Leukemia Reveals a Distinctive Signature in Stereotyped Subset #8,” in Blood, Nov. 2019, vol. 134, no. Supplement_1, pp. 1241–1241, doi: 10.1182/blood-2019-127817.
In CLL, subsets of patients carrying stereotyped B cell receptors (BcR) share similar biological and clinical features independently of IGHV gene somatic hypermutation status. Although the chromatin landscape of CLL as a whole has been recently characterized, it remains largely unexplored in stereotyped cases. Here, we analyzed the active chromatin regulatory landscape of 3 major CLL stereotyped subsets associated with clinical aggressiveness.We performed chromatin-immunoprecipitation followed by sequencing (ChIP-Seq) with an antibody for the H3K27ac histone mark in sorted CLL cells from 19 cases, including clinically aggressive subsets #1 (clan I genes/IGKV(D)1-39, IG-unmutated CLL (U-CLL)(n=3)], #2 [IGHV3-21/IGLV3-21, IG-mutated CLL (M-CLL)(n=3)] and #8 [IGHV4-39/IGKV1(D)-39, U-CLL(n=3)] which we compared to non-stereotyped CLL cases [5 M-CLL|5 U-CLL]. In addition, a series of 15 normal B cell samples from different stages of B-cell differentiation were analyzed [naive B cells from peripheral blood (n=3), tonsillar naive B cells (n=3), germinal centre (GC) B cells (n=3), memory B cells (n=3), tonsillar plasma cells (n=3)].Initial unsupervised principal component analysis (PCA) disclosed a distinct chromatin acetylation pattern in CLL, regardless of stereotypy status, versus normal B cells. CLL as a whole was found to be closer to naive and memory B cells rather than GC B cells and plasma cells. Detailed analysis of individual principal components (PC) revealed that PC4, which accounts for 5\% of the total variability, segregated subset #8 cases and GC B cells from other CLLs and normal B cell subpopulations. Although PC4 accounts for only a small part of the total variability (5\%), this suggests that subset #8 cases may share some chromatin features with proliferating GC B cells, in line with the fact that subset #8 BcR are IgG-switched.We also investigated whether stereotyped CLLs have different chromatin acetylation features compared to non-stereotyped CLLs matched by IGHV somatic hypermutation status and identified 878 Differential Regions (DR) in subset #8 vs. U-CLL, 84 DR in subset #1 vs. U-CLL and 66 DR in #2 compared vs. M-CLL.As subset #8 cases seemed to have the most distinct profile, we further characterized the detected regions. The 435 and 443 regions gaining and losing activation, respectively, mostly targeted promoters (29.5\%) and regulatory elements located in introns (31\%) and distal intergenic regions (21.8\%). Hierarchical clustering based on the 878 DRs enabled the clear discrimination of subset #8 cases from U-CLL and normal B cells; however, it is worth noting that for several of these 878 DRs the acetylation patterns were shared between subset #8 and normal B cell subpopulations rather than subset #8 and U-CLL.Of note, 11/435 regions gaining activity on subset #8 were found within the gene encoding for the EBF1 transcription factor (TF); additional regions were associated with genes significant to CLL pathogenesis, e.g. TCF4 and E2F1. Moreover, 3 DRs losing activity in subset #8 were located within the CTLA4 gene and 2 DRs within the IL21R gene, which we have recently reported as hypermethylated and not expressed in subset #8.Next, we performed TF binding site analysis by MEME/AME suit, separately for regions gaining or losing activity, and identified significant enrichment (adj-p\<0.001) on TFs such as AP-1, FOX, GATA, IRF. The regions losing activity in subset #8 showed a higher number of enriched TFs versus those gaining activity (165 vs 93 TFs), particularly displaying enrichment for many HOX family members . However, a cluster of TFs with enrichment on TF binding site analysis, such as FOXO1, FOXP1, MEF2D, PRDM1, RUNX1, RXRA, STAT6, were also located within the 878 DRs discriminating subset #8 from either U-CLL or normal B cell subpopulations.Taken together, subset #8 cases have a distinct chromatin acetylation signature which includes both loss and gain of active elements, shared features with proliferating GC B cells, and specific changes in chromatin activity of several genes and TFs relevant to B cell/CLL biology. These findings further underscore the concept that BcR stereotypy defines subsets of patients with consistent biological profile, while they may also be relevant to the particular clinical behavior of subset #8, known to be associated with the highest risk of Richter’s transformation amongst all CLL.Stamatopoulos:Abbvie: Honoraria, Research Funding; Janssen: Honoraria, Research Funding.
- N. Pechlivanis et al., “Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater\mathsemicolon a use-case in the metropolitan area of Thessaloniki, Greece.” Cold Spring Harbor Laboratory, Mar. 2021, doi: 10.1101/2021.03.17.21252673.
- R. Alves et al., “ELIXIR Software Management Plan for Life Sciences.” BioHackrXiv, 2021, doi: 10.37044/osf.io/k8znb.
<p>Data Management Plans are now considered a key element of Open Science. They describe the data management life cycle for the data to be collected, processed and/or generated within the lifetime of a particular project or activity. A Software Manag ement Plan (SMP) plays the same role but for software. Beyond its management perspective, the main advantage of an SMP is that it both provides clear context to the software that is being developed and raises awareness. Although there are a few SMPs already available, most of them require significant technical knowledge to be effectively used. ELIXIR has developed a low-barrier SMP, specifically tailored for life science researchers, aligned to the FAIR Research Software principles. Starting from the Four Recommendations for Open Source Software, the ELIXIR SMP was iteratively refined by surveying the practices of the community and incorporating the received feedback. Currently available as a survey, future plans of the ELIXIR SMP include a human- and machine-readable version, that can be automatically queried and connected to relevant tools and metrics within the ELIXIR Tools ecosystem and beyond.</p>
- F. Ballesio et al., “Determining a novel feature-space for SARS-CoV-2 sequence data.” Center for Open Science, 2020, doi: 10.37044/osf.io/xt7gw.
- F. Psomopoulos, C. W. G. van Gelder, P. Kahlem, B. Leskošek, and J. Lindvall, “ELIXIR Training Platform Task 2: Gap analysis, training materials development and training delivery,” F1000Research, vol. 9. 2020.
- M. Tsagiopoulou, N. Pechlivanis, and F. Psomopoulos, “InterTADs: Integration of Multi-Omics Data on Topological Associated Domains.” Aug. 2020, doi: 10.21203/rs.3.rs-54194/v1.
- RDA COVID-19 Working Group, “Recommendations and Guidelines on data sharing,” Research Data Alliance. 2020, doi: 10.15497/rda00052.
- S. Athanasiou et al., “National Plan for Open Science.” Zenodo, Jun. 2020, doi: 10.5281/zenodo.3908953.
- A. Nicolaidis and F. Psomopoulos, “DNA coding and Gödel numbering.” 2019, [Online]. Available at: https://arxiv.org/abs/1909.13574.
- E. A. Becker et al., “datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019.” Jun. 2019, doi: 10.5281/zenodo.3260609.