Data & Metadata
Data Cleaning
Bioinformatics tools are very strict about filenames, sample IDs, and metadata.
Even one wrong character can:
break pipelines
mismatch samples
cause failed downstream analyses (e.g. DESeq2, Seurat)
make reproducibility impossible
So, we require standardized naming.
❌ Common Naming Problems
List of common errors found in file names metadata:
Duplicate sample names (e.g.,
sample1appearing twice)Greek characters (
α,β,Α,Τ)Spaces in names (
Sample 1,Control rep 2)Special characters (
!,%,@,#,(,),/)Different names for the same sample (
CTRL1,Ctrl1,control_1)
✅ Naming Rules
Sample names MUST:
use English letters only
never change file names between FASTQ - metadata - results
use only letters, numbers, and underscores (A–Z, a–z, 0–9, _)
✔
A01_Rep1❌
A01-Rep(1)unique names
be consistent across files and folders
📄 Project Documentation (The README file)
Standardized names are only useful if we know the context of the project. To prevent the loss of critical project details (who did what a year later), every project folder MUST contain a README file.
Recommended Template
Create a file named README.txt in your main project folder and use this structure:
Project: Short Title
Project owner: Name
Bioinformatician: Name
Date Started: YYYY-MM-DD
Status: In Progress / Completed / Archived
Experimental Details: A brief description with the contex of the project (e.g. Comparisons, Groups)