Our research activities

Listing our projects here

Machine Learning  


Core Bioinformatics tool development  


Machine Learning

Machine Learning

Machine Learning approaches are widely used in this era of highly dimensional Biological “Big-Data” to leverage actionable biological and potentially therapeutic insights.

Applying approaches:

  • Principal Component Analysis (PCA) and generalized Principal Component Analysis (GLM-PCA)
  • Random Forest
  • Support vector machine
  • Cluster Analysis
  • t-SNE / UMAP
  • Feature selection
  • Ensemble methods metaDIEA

Leading efforts on standards:


The Next Generation Sequencing (NGS) pipelines generate voluminous biological data. The process of analyzing omics data usually leads to a high dimensional matrix, with the different cases listed as columns and the locations on the genome in which the examined event happened (e.g. mutation, gene expression etc.) as rows.

Relevant computational approaches in the biomedical domain include:

  1. High throughput immunogenetic analysis in health and disease
  2. Integrated omics analysis of hematologic malignancies: DNA methylome and Transcriptome Profiling in patients with hematologic malignancies treated with chemotherapy versus novel agents; Whole exome sequencing of pre-malignant conditions
  3. Targeted NGS analysis for improved risk stratification in cancer: novel recurrent gene mutations in hematologic malignancies

Analysis of raw NGS data (.fastq)

  • Immuno-sequencing (IMGT)
  • RNAseq (HISAT2)
  • WES (GATK)
  • ChIP-seq (MACS2)

Downstream analysis

  • Immunogetics analysis (TRIP - T cell receptor/immunoglobulin profiler, IgIDivA )
  • Differential analysis (DEseq2, limma)
  • Eenrichment analysis (Over Representation Analysis/GSEA, KEGG, GO, Reactome , WikiPathway, Appyters)
  • Transciption factor binding site analysis (PWA)
  • Network-based Bayesian inference of signaling drivers and transcription factors (NetBID, Cytoscape, STRINGdb)

Data Integration

We developed a high-speed framework, called InterTADs, for integrating multi-omics data from the same physical source (e.g. patient) considering the chromatin configuration of the genome, i.e. the topologically associating domains (TADs). Check it out on our github

Single-cell RNA-seq (scRNA-seq) analysis: A Systems approach

We are assembling a computational pipeline with various in silico tools for the analysis of scRNA-seq data to facilitate the inference of actionable biological information for Systems Immunology studies. Some indicative tools from our pipeline are cited below:

  1. Pre-processing (batch FASTQ to count matrices) (Starsolo),
  2. Basic downstream analysis (quality control, normalization, selection of variable features, scaling, dimensionality reduction, integration/batch correction, clustering, visualization) (Seurat, Monocle3, Scanpy)
  3. Functional studies:
    • Cell annotation (SingleR)
    • Trajectory pseudotime (SCORPIUS, Monocle3)
    • Gene regulatory networks (SCENIC)
    • Pathway enrichment (VAM)
    • Cell-cell interactions (CellChat, NicheNet)

Pan-genome analysis and microbial Genome-wide Association studies (GWAS)

We are implementing a computational workflow which integrates various software tools in order to detect significant associations between genomic features, such as homolog gene clusters or SNPs, and phenotypic traits of interest (e.g., antimicrobial resistance) concerning different microorganisms. Some of the steps of are workflow are cited below:

  • Pipeline for SNP calling in core genes (100% read coverage)(BWA-MEM, GATK4 HaplotypeCaller, samtools, bcftools, vcftools)
  • SNP concatenation & phylogenetic tree (RAxML)
  • Genome assembly (SPAdes, Quast)
  • Annotation of genes related to antibiotic resistance, virulence, and stress (AMRFinderPlus)
  • Genome annotation (PROKKA)
  • Pan-genome analysis (Panaroo)
  • Genome-wide association study (pyseer)
  • Microbial gene annotation (InterProScan)
  • Custom R scripts for further analysis and interpretation of the results

Core Bioinformatics tool development

We are also active in developing tailored bioinformatic tools that addess specific challenges. These include:

  1. NGS Wastewater analysis of SARS-CoV-2 mutations (lineagespot )
  2. Demultiplexing of UMIs (Umic )
  3. Tools for analyzing miRNA data (mirkit )
  4. Using k-mer based representation of omics data (Goedel, k-taxa, kmeranalyzer)


Introduction to Machine Learning: Opportunities for advancing omics data analysis


Machine learning has emerged as a discipline that enables computers to assist humans in making sense of large and complex data sets. With the drop-in cost of sequencing technologies, large amounts of omics data are being generated and made accessible to researchers. Analyzing these complex high-volume data is not trivial and the use of classical tools cannot explore their full potential. Machine learning can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of medicine and improve health care.

The aim of this tutorial is to introduce participants to the Machine learning (ML) taxonomy and common machine learning algorithms. The tutorial will cover the methods being used to analyze different omics data sets by providing a practical context through the use of basic but widely used R and Python libraries. The tutorial will comprise a number of hands on exercises and challenges, where the participants will acquire a first understanding of the standard ML processes as well as the practical skills in applying them on familiar problems and publicly available real-world data sets.

ELIXIR / CODATA-RDA Research Data Science Advanced Workshop on Bioinformatics


As part of the CODATA-RDA Research Data Science Summer School in Trieste (6-17 Aug 2018 and 5-16 Aug 2019), ELIXIR Training has contributed in the organization of the flanking Advanced Bioinformatics workshop (20-24 Aug 2018 and 19-23 Aug 2019) with a particular focus on Machine Learning applications, as part of the ELIXIR Implementation Study on Learning Paths.

This advanced bioinformatics course provided an overview of the current status of different NGS workflows (variant calling, RNA-Seq, ChIP-Seq, Metagenomics etc), and combined them with the appropriate Machine Learning and Data Mining approaches. After providing a strong foundation of the underlying theory and concepts, the course relied heavily on hand-on exercises and tutorials in order for the participants to directly apply and practice on methods and techniques presented throughout each day.

The course was led by me (INAB/CERTH, ELIXIR Greece), with Amel Ghouila (Institut Pasteur de Tunis / H3ABionet), Gabriele Schweikert (Cyber Valley Initiative, University of Tübingen, DE / Computational Biology, University of Dundee, UK) and Phelelani Mpangase (Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa) as co-instructors of the course. Moreover, the daily training activities were supported by three helpers; Maria Tsagiopoulou (INAB / CERTH), Ola Karrar (University of Khartoum) and a past learner to the CODATA-RDA Research Data Science Summer School, and David Helekal (University of Dundee).

Workshop on Reproducible analysis and Research Transparency


This workshop was part of the Open Science Tools, Data & Technologies for Efficient Ecological & Evolutionary Research Symposium, organized by NIOO-KNAW and DANS-KNAW on 7 & 8 December 2017 at the Amsterdam Science Park.

This workshop provided an overview of the to date status in reproducible analysis in order to provide transparency in research. The workshop covered methodological topics (such as the use of the Open Science Framework and reporting guidelines) as well as software tools (such as Git, Docker, RMarkdown / knitr and Jupyter). Going beyond simple listing and presentations, the workshop focused on hands-on skill building, with exercises and tutorials covering most of the software aspects.