Data & Metadata

Data Cleaning

Bioinformatics tools are very strict about filenames, sample IDs, and metadata.

Even one wrong character can:

  • break pipelines

  • mismatch samples

  • cause failed downstream analyses (e.g. DESeq2, Seurat)

  • make reproducibility impossible

So, we require standardized naming.

❌ Common Naming Problems

List of common errors found in file names metadata:

  • Duplicate sample names (e.g., sample1 appearing twice)

  • Greek characters (α, β, Α, Τ)

  • Spaces in names (Sample 1, Control rep 2)

  • Special characters (!, %, @, #, (, ), /)

  • Different names for the same sample (CTRL1, Ctrl1, control_1)

✅ Naming Rules

Sample names MUST:

  • use English letters only

  • never change file names between FASTQ - metadata - results

  • use only letters, numbers, and underscores (A–Z, a–z, 0–9, _)

    A01_Rep1

    A01-Rep(1)

  • unique names

  • be consistent across files and folders

📄 Project Documentation (The README file)

Standardized names are only useful if we know the context of the project. To prevent the loss of critical project details (who did what a year later), every project folder MUST contain a README file.