Scalable FASTQ QC:
Merging, Automation, and MultiQC Reporting
A guide to scalable FASTQ quality control, including merging strategies, automation techniques, and MultiQC reporting.
const metadata = ;
_A practical guide to managing sequencing quality control at scale_
Introduction
You just got your sequencing data back. Instead of a few tidy files, you're staring at 300 FASTQ files. Running FastQC on each one means 300 HTML reports to open manually. Click, scroll, close. Click, scroll, close. Repeat 298 more times.
I once spent an entire afternoon doing exactly this before someone told me about MultiQC. _Learn from my pain._
This guide shows you how to merge those files intelligently, automate FastQC across your entire dataset, and aggregate everything into one clean report. We'll go from chaos to a reproducible workflow you can use on every project.
Download the FASTQ QC Checklist
A printable checklist with key steps and quick reference commands.
Key Takeaways
If you only have 30 seconds, here's what you need to know:
- Modern Illumina sequencing splits output into many files per sample (lanes, tiles, legacy demultiplexing via CASAVA/bcl2fastq), this is normal, not an error
- FASTQ files from the same sample and read direction (R1 or R2) can be safely merged with cat BEFORE alignment
- Always verify your merge with line counts (zcat file | wc -l) or stream checksums, silent corruption is real
- FastQC can be parallelized with the -t flag, GNU parallel, or simple bash loops
- MultiQC aggregates hundreds of FastQC reports into one interactive HTML dashboard
- Ask your sequencing facility about --no-lane-splitting (bcl2fastq2/BCL Convert) to avoid fragmentation at the source
- A standardized, documented QC pipeline saves hours on every project and prevents rookie mistakes
Prerequisites
Who is this guide for?
This guide is written for early-career bioinformaticians, students in computational biology, and biologists analyzing sequencing data. If you're an expert, feel free to skip to the code examples. If terms like "R1/R2" or "demultiplexing" are new to you, start with the glossary below.
Software Requirements
- Linux/macOS terminal (or WSL on Windows)
- FastQC (v0.11.9 or later)
- MultiQC (v1.14 or later)
- Python 3.6+ (required for MultiQC)
- Basic bash/command-line knowledge
Glossary of Key Terms
If you're new to sequencing data, here's what these terms mean. Experts can skip this.
Term
Definition
FASTQ
Text-based format for storing biological sequences and their quality scores. Each read consists of 4 lines: identifier, sequence, separator (+), and quality scores (ASCII-encoded).
R1/R2 Files
Paired-end sequencing produces two files per sample: R1 (forward reads) and R2 (reverse reads). These must be kept synchronized. R2 often has lower quality due to sequencing chemistry, this is normal.
Lanes
Physical divisions on an Illumina flow cell. Samples are often split across multiple lanes for throughput, creating multiple files per sample that need to be merged.
FastQC
Widely-used tool that generates quality metrics for sequencing data, producing an HTML report with visualizations for each input file. Does not aggregate across files.
MultiQC
Aggregation tool that combines outputs from FastQC and 150+ other bioinformatics tools into a single interactive HTML report. The solution to "300 reports" problem.
Demultiplexing
Process of separating pooled samples based on their unique barcode sequences after sequencing. Happens before you receive your data.
bcl2fastq
Illumina's software for converting raw BCL files to FASTQ format. The --no-lane-splitting option can prevent file fragmentation.
Why So Many Files?
Three reasons this keeps happening:
1\. Sequencing output is fragmented. Historically, Illumina software (CASAVA, older bcl2fastq) split data by tiles, lanes, or file size limits. Old habits die hard, and many facilities still use these legacy configurations.
2\. FastQC treats every file independently. There's no built-in aggregation. 300 input files = 300 output reports. It scales linearly with your suffering.
3\. Labs inherit outdated practices. Directory structures and workflows from 2015 are still running in 2025. Nobody wants to touch "the pipeline that works."
Here's a real question from Biostars that captures the frustration perfectly:
_"Each sample (human genome) have about 250-300 fastq.gz files... I have to manually check 250-300 fastqc folder to know the quality by opening .html page. Is there any way where I can have summary of overall quality?"_
- Biostars user
Yes. Yes there is.
Verifying Your Installation
Before we start, let's make sure the tools are actually installed.
Check FastQC:
`bash
fastqc --version
`
`
FastQC v0.12.1
`
Check MultiQC:
`bash
multiqc --version
`
`
multiqc, version 1.21
`
If you need to install them:
`bash
Conda (recommended for reproducibility)
conda install -c bioconda fastqc multiqc
Or separately with pip/apt
pip install multiqc
sudo apt-get install fastqc Ubuntu/Debian
brew install fastqc macOS
`
Merging FASTQ Files
Why merge at all?
When your sample is split across multiple lanes, those lanes are technical replicates of the same biological material, not separate experiments. Merging brings all that data together so downstream tools (aligners, variant callers) see the complete picture for each sample.
Without merging, you get fragmented data management and unnecessary I/O overhead on hundreds of files.
The rules:
- Merge R1 with R1, R2 with R2 (never mix them)
- Merge AFTER demultiplexing, BEFORE alignment
- Same sample only
Basic merge with cat:
`bash
cat sample001__R1_.fastq.gz > sample001_R1_merged.fastq.gz
cat sample001__R2_.fastq.gz > sample001_R2_merged.fastq.gz
`
Alternative: zcat for better compression (from Biostars):
A Biostars user noted that directly concatenating gzipped files can produce larger output. For better compression efficiency:
`bash
Decompress, concatenate, recompress for smaller file size
zcat sample001_L00_R1_.fastq.gz | gzip > sample001_R1_merged.fastq.gz
`
Batch merge script:
`bash
#!/bin/bash
mkdir -p merged
SAMPLES=$(ls _R1_.fastq.gz | cut -d'_' -f1 | sort -u)
for SAMPLE in $SAMPLES; do
echo "Merging $SAMPLE..."
cat $__R1_.fastq.gz > merged/$_R1.fastq.gz
cat $__R2_.fastq.gz > merged/$_R2.fastq.gz
done
echo "Done!"
`
How to verify the merge worked (DO NOT SKIP THIS):
Never skip verification. Silent corruption will ruin your downstream analysis and you won't know until days later when your alignment fails or produces garbage.
Method 1: Line count verification (recommended)
`bash
Count lines in original files (pipe through zcat to decompress)
zcat sample001_L001_R1_.fastq.gz sample001_L002_R1_.fastq.gz | wc -l
Output: 40000000
Count lines in merged file
zcat sample001_R1_merged.fastq.gz | wc -l
Output: 40000000
These numbers MUST match. If they don't, your merge failed.
`
Method 2: Stream checksum verification
`bash
Checksum the concatenated stream from originals
cat sample001_L00_R1_.fastq.gz | md5sum
Output: a1b2c3d4e5f6... -
Checksum the merged file
md5sum sample001_R1_merged.fastq.gz
Output: a1b2c3d4e5f6... sample001_R1_merged.fastq.gz
These checksums MUST be identical.
`
Method 3: Gzip integrity check
`bash
gzip -t sample001_R1_merged.fastq.gz && echo "OK" || echo "CORRUPTED"
`
Running FastQC at Scale
There are several ways to run FastQC efficiently. Here are the options from simplest to fastest.
Option 1: Single file (the slow way)
`bash
fastqc sample001_R1.fastq.gz
Output: sample001_R1_fastqc.html and sample001_R1_fastqc.zip
`
Option 2: Built-in threading with -t flag
FastQC can process multiple files in parallel using its -t option:
`bash
fastqc -t 8 -o ./qc_results/ *.fastq.gz
`
`
Started analysis of sample001_R1.fastq.gz
Approx 5% complete for sample001_R1.fastq.gz
...
Analysis complete for sample001_R1.fastq.gz
`
Option 3: GNU parallel (even faster)
For large datasets, GNU parallel gives you more control:
`bash
Install parallel if needed: sudo apt-get install parallel
ls *.fastq.gz | parallel -j 8 fastqc -o ./qc_results/
`
Option 4: Simple bash loop (beginner-friendly)
If you're just starting out, a basic loop with progress reporting works fine:
`bash
#!/bin/bash
mkdir -p qc_results
FILES=(*.fastq.gz)
TOTAL=$
COUNT=0
for FILE in "$"; do
((COUNT++))
echo "[$COUNT/$TOTAL] Processing $FILE..."
fastqc -q -o qc_results/ "$FILE"
done
echo "FastQC complete!"
`
Aggregating with MultiQC
This is where the magic happens. MultiQC scans a directory for FastQC outputs and combines them into a single interactive report.
`bash
cd qc_results/
multiqc .
`
`
MultiQC v1.21
| multiqc | Search path : /home/user/project/qc_results
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 48/48
| fastqc | Found 48 reports
| multiqc | Report : multiqc_report.html
| multiqc | Data : multiqc_data
| multiqc | MultiQC complete
`
48 reports → 1 interactive HTML file. That's the whole point.
Custom output with title and comments:
`bash
multiqc . \
--filename my_project_qc \
--title "WGS Batch Dec 2025" \
--comment "24 samples, 30X coverage"
`
Complete Workflow Script
Note for beginners: If you're new to bash, work through the sections above first to understand each step. This script ties everything together into a single, reproducible pipeline.
`bash
#!/bin/bash
fastq_qc_pipeline.sh
Complete FASTQ QC workflow: merge, verify, QC, aggregate
set -e Exit on error
INPUT_DIR="./raw_reads"
MERGED_DIR="./merged"
QC_DIR="./qc_results"
THREADS=8
echo "=== FASTQ QC Pipeline ==="
echo "Input: $INPUT_DIR"
echo "Threads: $THREADS"
echo ""
Create directories
mkdir -p "$MERGED_DIR" "$QC_DIR"
Step 1: Merge FASTQ files by sample
echo "[1/4] Merging FASTQ files..."
cd "$INPUT_DIR"
SAMPLES=$(ls _R1_.fastq.gz 2>/dev/null | cut -d'_' -f1 | sort -u)
for SAMPLE in $SAMPLES; do
echo " Merging $SAMPLE..."
cat $__R1_.fastq.gz > "../$MERGED_DIR/$_R1.fastq.gz"
cat $__R2_.fastq.gz > "../$MERGED_DIR/$_R2.fastq.gz"
done
cd ..
Step 2: Verify merge integrity
echo "[2/4] Verifying file integrity..."
for FILE in "$MERGED_DIR"/*.fastq.gz; do
if gzip -t "$FILE" 2>/dev/null; then
echo " OK: $(basename $FILE)"
else
echo " ERROR: $(basename $FILE) is corrupted!"
exit 1
fi
done
Step 3: Run FastQC
echo "[3/4] Running FastQC..."
fastqc -t "$THREADS" -o "$QC_DIR" "$MERGED_DIR"/*.fastq.gz
Step 4: Generate MultiQC report
echo "[4/4] Generating MultiQC report..."
multiqc "$QC_DIR" -o "$QC_DIR" --filename final_qc_report
echo ""
echo "=== Pipeline Complete ==="
echo "Report: $QC_DIR/final_qc_report.html"
`
This gives you a solid, reproducible QC foundation. For production environments where you need real-time visibility into complex pipelines at scale, error monitoring tools designed for bioinformatics (like Tracer.cloud) can help you catch failures before they propagate through your entire analysis.
Common Mistakes
These come from real Biostars threads and community forums. They're embarrassingly common, I've made most of them myself.
Mistake 1: Merging R1 with R2
`bash
WRONG - breaks paired-end data!
cat sample_R1.fastq.gz sample_R2.fastq.gz > sample_merged.fastq.gz
CORRECT - keep R1 and R2 separate
cat sample__R1_.fastq.gz > sample_R1.fastq.gz
cat sample__R2_.fastq.gz > sample_R2.fastq.gz
`
This mistake will silently destroy your paired-end analysis. Aligners expect R1 and R2 files to have matching read pairs in the same order.
Mistake 2: Not verifying after merge
As shown in Section 6, always verify with line counts:
`bash
Before merge
zcat original_L00_R1_.fastq.gz | wc -l
After merge
zcat merged_R1.fastq.gz | wc -l
Numbers must match!
`
Silent corruption happens more often than you think, especially with network file systems or interrupted transfers.
Mistake 3: Running FastQC before merging
You'll create hundreds of unnecessary reports. Merge first, QC second. Your MultiQC report will also be cleaner, per-lane variation can look like quality problems when it's just normal technical noise.
Mistake 4: Not checking both R1 and R2
R2 quality is often worse than R1 (sequencing chemistry degrades over the run). This is normal! But you need to check both files and compare them in MultiQC. If R2 looks significantly worse than usual, you may need trimming or re-sequencing.
Mistake 5: Skipping gzip integrity checks
`bash
Takes 2 seconds, saves 2 days of debugging
gzip -t file.fastq.gz && echo "OK" || echo "CORRUPTED"
`
Corrupted gzip files will cause cryptic errors in downstream tools that are nearly impossible to diagnose.
Mistake 6: Not requesting consolidated output from your facility
From the Biostars thread: "Rather than accepting hundreds of individual files per sample, request that sequencing facilities use bcl2fastq with --no-lane-splitting or --fastq-cluster-count 0." This prevents the problem at the source.
Conclusion
No more clicking through 300 HTML files. No more guessing which lane had the weird quality dip. One merged file per sample, one MultiQC report for everything.
The workflow:
1\. Merge by sample and read direction
2\. Verify integrity with line counts (zcat | wc -l) or checksums
3\. Run FastQC in parallel
4\. Aggregate with MultiQC
5\. Document and move on
_What QC nightmares have you run into? Drop them in the comments, maybe we can save someone else the headache._
Summary
Modern Illumina sequencing often produces hundreds of FASTQ files per sample due to lane splitting and legacy demultiplexing practices (CASAVA, older bcl2fastq). Running FastQC on each file individually creates an overwhelming number of reports that are impossible to review manually. This guide presents a scalable solution: merge FASTQ files by sample and read direction using simple cat commands, automate FastQC execution with parallel processing, and aggregate all results into a single interactive report using MultiQC. As noted in the Seqera documentation, MultiQC has become the standard for QC aggregation with over 1.5 million downloads and 25,000 daily runs.
The workflow reduces hours of manual QC review to minutes and ensures reproducibility across projects. Key steps include verifying merge integrity with line counts (zcat file | wc -l) or stream checksums, using the -t flag for parallel FastQC execution, and running MultiQC to generate clean, publication-ready reports. By standardizing this process, bioinformatics teams can catch quality issues early, like R2 quality degradation or adapter contamination, without drowning in individual HTML files. The complete workflow is captured in a single bash script that can be adapted for any sequencing project.
References
FastQC Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC Documentation: https://multiqc.info/
Seqera MultiQC: https://seqera.io/multiqc/
Biostars: Running FastQC on multiple files: https://www.biostars.org/p/141797/
Illumina: Concatenating FASTQ files: Illumina Knowledge Base
nf-core pipelines: https://nf-co.re/