Why We're Building Tracer: The Future of Scientific Computing Observability
More

Scalable FASTQ QC: Merging, Automation, and MultiQC Reporting

A guide to scalable FASTQ quality control, including merging strategies, automation techniques, and MultiQC reporting.

const metadata = ; _A practical guide to managing sequencing quality control at scale_ Introduction You just got your sequencing data back. Instead of a few tidy files, you're staring at 300 FASTQ files. Running FastQC on each one means 300 HTML reports to open manually. Click, scroll, close. Click, scroll, close. Repeat 298 more times. I once spent an entire afternoon doing exactly this before someone told me about MultiQC. _Learn from my pain._ This guide shows you how to merge those files intelligently, automate FastQC across your entire dataset, and aggregate everything into one clean report. We'll go from chaos to a reproducible workflow you can use on every project. Download the FASTQ QC Checklist A printable checklist with key steps and quick reference commands. Key Takeaways If you only have 30 seconds, here's what you need to know: - Modern Illumina sequencing splits output into many files per sample (lanes, tiles, legacy demultiplexing via CASAVA/bcl2fastq), this is normal, not an error - FASTQ files from the same sample and read direction (R1 or R2) can be safely merged with cat BEFORE alignment - Always verify your merge with line counts (zcat file | wc -l) or stream checksums, silent corruption is real - FastQC can be parallelized with the -t flag, GNU parallel, or simple bash loops - MultiQC aggregates hundreds of FastQC reports into one interactive HTML dashboard - Ask your sequencing facility about --no-lane-splitting (bcl2fastq2/BCL Convert) to avoid fragmentation at the source - A standardized, documented QC pipeline saves hours on every project and prevents rookie mistakes Prerequisites Who is this guide for? This guide is written for early-career bioinformaticians, students in computational biology, and biologists analyzing sequencing data. If you're an expert, feel free to skip to the code examples. If terms like "R1/R2" or "demultiplexing" are new to you, start with the glossary below. Software Requirements - Linux/macOS terminal (or WSL on Windows) - FastQC (v0.11.9 or later) - MultiQC (v1.14 or later) - Python 3.6+ (required for MultiQC) - Basic bash/command-line knowledge Glossary of Key Terms If you're new to sequencing data, here's what these terms mean. Experts can skip this. Term Definition FASTQ Text-based format for storing biological sequences and their quality scores. Each read consists of 4 lines: identifier, sequence, separator (+), and quality scores (ASCII-encoded). R1/R2 Files Paired-end sequencing produces two files per sample: R1 (forward reads) and R2 (reverse reads). These must be kept synchronized. R2 often has lower quality due to sequencing chemistry, this is normal. Lanes Physical divisions on an Illumina flow cell. Samples are often split across multiple lanes for throughput, creating multiple files per sample that need to be merged. FastQC Widely-used tool that generates quality metrics for sequencing data, producing an HTML report with visualizations for each input file. Does not aggregate across files. MultiQC Aggregation tool that combines outputs from FastQC and 150+ other bioinformatics tools into a single interactive HTML report. The solution to "300 reports" problem. Demultiplexing Process of separating pooled samples based on their unique barcode sequences after sequencing. Happens before you receive your data. bcl2fastq Illumina's software for converting raw BCL files to FASTQ format. The --no-lane-splitting option can prevent file fragmentation. Why So Many Files? Three reasons this keeps happening: 1\. Sequencing output is fragmented. Historically, Illumina software (CASAVA, older bcl2fastq) split data by tiles, lanes, or file size limits. Old habits die hard, and many facilities still use these legacy configurations. 2\. FastQC treats every file independently. There's no built-in aggregation. 300 input files = 300 output reports. It scales linearly with your suffering. 3\. Labs inherit outdated practices. Directory structures and workflows from 2015 are still running in 2025. Nobody wants to touch "the pipeline that works." Here's a real question from Biostars that captures the frustration perfectly: _"Each sample (human genome) have about 250-300 fastq.gz files... I have to manually check 250-300 fastqc folder to know the quality by opening .html page. Is there any way where I can have summary of overall quality?"_ - Biostars user Yes. Yes there is. Verifying Your Installation Before we start, let's make sure the tools are actually installed. Check FastQC: `bash fastqc --version ` ` FastQC v0.12.1 ` Check MultiQC: `bash multiqc --version ` ` multiqc, version 1.21 ` If you need to install them: `bash Conda (recommended for reproducibility) conda install -c bioconda fastqc multiqc Or separately with pip/apt pip install multiqc sudo apt-get install fastqc Ubuntu/Debian brew install fastqc macOS ` Merging FASTQ Files Why merge at all? When your sample is split across multiple lanes, those lanes are technical replicates of the same biological material, not separate experiments. Merging brings all that data together so downstream tools (aligners, variant callers) see the complete picture for each sample. Without merging, you get fragmented data management and unnecessary I/O overhead on hundreds of files. The rules: - Merge R1 with R1, R2 with R2 (never mix them) - Merge AFTER demultiplexing, BEFORE alignment - Same sample only Basic merge with cat: `bash cat sample001__R1_.fastq.gz > sample001_R1_merged.fastq.gz cat sample001__R2_.fastq.gz > sample001_R2_merged.fastq.gz ` Alternative: zcat for better compression (from Biostars): A Biostars user noted that directly concatenating gzipped files can produce larger output. For better compression efficiency: `bash Decompress, concatenate, recompress for smaller file size zcat sample001_L00_R1_.fastq.gz | gzip > sample001_R1_merged.fastq.gz ` Batch merge script: `bash #!/bin/bash mkdir -p merged SAMPLES=$(ls _R1_.fastq.gz | cut -d'_' -f1 | sort -u) for SAMPLE in $SAMPLES; do echo "Merging $SAMPLE..." cat $__R1_.fastq.gz > merged/$_R1.fastq.gz cat $__R2_.fastq.gz > merged/$_R2.fastq.gz done echo "Done!" ` How to verify the merge worked (DO NOT SKIP THIS): Never skip verification. Silent corruption will ruin your downstream analysis and you won't know until days later when your alignment fails or produces garbage. Method 1: Line count verification (recommended) `bash Count lines in original files (pipe through zcat to decompress) zcat sample001_L001_R1_.fastq.gz sample001_L002_R1_.fastq.gz | wc -l Output: 40000000 Count lines in merged file zcat sample001_R1_merged.fastq.gz | wc -l Output: 40000000 These numbers MUST match. If they don't, your merge failed. ` Method 2: Stream checksum verification `bash Checksum the concatenated stream from originals cat sample001_L00_R1_.fastq.gz | md5sum Output: a1b2c3d4e5f6... - Checksum the merged file md5sum sample001_R1_merged.fastq.gz Output: a1b2c3d4e5f6... sample001_R1_merged.fastq.gz These checksums MUST be identical. ` Method 3: Gzip integrity check `bash gzip -t sample001_R1_merged.fastq.gz && echo "OK" || echo "CORRUPTED" ` Running FastQC at Scale There are several ways to run FastQC efficiently. Here are the options from simplest to fastest. Option 1: Single file (the slow way) `bash fastqc sample001_R1.fastq.gz Output: sample001_R1_fastqc.html and sample001_R1_fastqc.zip ` Option 2: Built-in threading with -t flag FastQC can process multiple files in parallel using its -t option: `bash fastqc -t 8 -o ./qc_results/ *.fastq.gz ` ` Started analysis of sample001_R1.fastq.gz Approx 5% complete for sample001_R1.fastq.gz ... Analysis complete for sample001_R1.fastq.gz ` Option 3: GNU parallel (even faster) For large datasets, GNU parallel gives you more control: `bash Install parallel if needed: sudo apt-get install parallel ls *.fastq.gz | parallel -j 8 fastqc -o ./qc_results/ ` Option 4: Simple bash loop (beginner-friendly) If you're just starting out, a basic loop with progress reporting works fine: `bash #!/bin/bash mkdir -p qc_results FILES=(*.fastq.gz) TOTAL=$ COUNT=0 for FILE in "$"; do ((COUNT++)) echo "[$COUNT/$TOTAL] Processing $FILE..." fastqc -q -o qc_results/ "$FILE" done echo "FastQC complete!" ` Aggregating with MultiQC This is where the magic happens. MultiQC scans a directory for FastQC outputs and combines them into a single interactive report. `bash cd qc_results/ multiqc . ` ` MultiQC v1.21 | multiqc | Search path : /home/user/project/qc_results | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 48/48 | fastqc | Found 48 reports | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete ` 48 reports → 1 interactive HTML file. That's the whole point. Custom output with title and comments: `bash multiqc . \ --filename my_project_qc \ --title "WGS Batch Dec 2025" \ --comment "24 samples, 30X coverage" ` Complete Workflow Script Note for beginners: If you're new to bash, work through the sections above first to understand each step. This script ties everything together into a single, reproducible pipeline. `bash #!/bin/bash fastq_qc_pipeline.sh Complete FASTQ QC workflow: merge, verify, QC, aggregate set -e Exit on error INPUT_DIR="./raw_reads" MERGED_DIR="./merged" QC_DIR="./qc_results" THREADS=8 echo "=== FASTQ QC Pipeline ===" echo "Input: $INPUT_DIR" echo "Threads: $THREADS" echo "" Create directories mkdir -p "$MERGED_DIR" "$QC_DIR" Step 1: Merge FASTQ files by sample echo "[1/4] Merging FASTQ files..." cd "$INPUT_DIR" SAMPLES=$(ls _R1_.fastq.gz 2>/dev/null | cut -d'_' -f1 | sort -u) for SAMPLE in $SAMPLES; do echo " Merging $SAMPLE..." cat $__R1_.fastq.gz > "../$MERGED_DIR/$_R1.fastq.gz" cat $__R2_.fastq.gz > "../$MERGED_DIR/$_R2.fastq.gz" done cd .. Step 2: Verify merge integrity echo "[2/4] Verifying file integrity..." for FILE in "$MERGED_DIR"/*.fastq.gz; do if gzip -t "$FILE" 2>/dev/null; then echo " OK: $(basename $FILE)" else echo " ERROR: $(basename $FILE) is corrupted!" exit 1 fi done Step 3: Run FastQC echo "[3/4] Running FastQC..." fastqc -t "$THREADS" -o "$QC_DIR" "$MERGED_DIR"/*.fastq.gz Step 4: Generate MultiQC report echo "[4/4] Generating MultiQC report..." multiqc "$QC_DIR" -o "$QC_DIR" --filename final_qc_report echo "" echo "=== Pipeline Complete ===" echo "Report: $QC_DIR/final_qc_report.html" ` This gives you a solid, reproducible QC foundation. For production environments where you need real-time visibility into complex pipelines at scale, error monitoring tools designed for bioinformatics (like Tracer.cloud) can help you catch failures before they propagate through your entire analysis. Common Mistakes These come from real Biostars threads and community forums. They're embarrassingly common, I've made most of them myself. Mistake 1: Merging R1 with R2 `bash WRONG - breaks paired-end data! cat sample_R1.fastq.gz sample_R2.fastq.gz > sample_merged.fastq.gz CORRECT - keep R1 and R2 separate cat sample__R1_.fastq.gz > sample_R1.fastq.gz cat sample__R2_.fastq.gz > sample_R2.fastq.gz ` This mistake will silently destroy your paired-end analysis. Aligners expect R1 and R2 files to have matching read pairs in the same order. Mistake 2: Not verifying after merge As shown in Section 6, always verify with line counts: `bash Before merge zcat original_L00_R1_.fastq.gz | wc -l After merge zcat merged_R1.fastq.gz | wc -l Numbers must match! ` Silent corruption happens more often than you think, especially with network file systems or interrupted transfers. Mistake 3: Running FastQC before merging You'll create hundreds of unnecessary reports. Merge first, QC second. Your MultiQC report will also be cleaner, per-lane variation can look like quality problems when it's just normal technical noise. Mistake 4: Not checking both R1 and R2 R2 quality is often worse than R1 (sequencing chemistry degrades over the run). This is normal! But you need to check both files and compare them in MultiQC. If R2 looks significantly worse than usual, you may need trimming or re-sequencing. Mistake 5: Skipping gzip integrity checks `bash Takes 2 seconds, saves 2 days of debugging gzip -t file.fastq.gz && echo "OK" || echo "CORRUPTED" ` Corrupted gzip files will cause cryptic errors in downstream tools that are nearly impossible to diagnose. Mistake 6: Not requesting consolidated output from your facility From the Biostars thread: "Rather than accepting hundreds of individual files per sample, request that sequencing facilities use bcl2fastq with --no-lane-splitting or --fastq-cluster-count 0." This prevents the problem at the source. Conclusion No more clicking through 300 HTML files. No more guessing which lane had the weird quality dip. One merged file per sample, one MultiQC report for everything. The workflow: 1\. Merge by sample and read direction 2\. Verify integrity with line counts (zcat | wc -l) or checksums 3\. Run FastQC in parallel 4\. Aggregate with MultiQC 5\. Document and move on _What QC nightmares have you run into? Drop them in the comments, maybe we can save someone else the headache._ Summary Modern Illumina sequencing often produces hundreds of FASTQ files per sample due to lane splitting and legacy demultiplexing practices (CASAVA, older bcl2fastq). Running FastQC on each file individually creates an overwhelming number of reports that are impossible to review manually. This guide presents a scalable solution: merge FASTQ files by sample and read direction using simple cat commands, automate FastQC execution with parallel processing, and aggregate all results into a single interactive report using MultiQC. As noted in the Seqera documentation, MultiQC has become the standard for QC aggregation with over 1.5 million downloads and 25,000 daily runs. The workflow reduces hours of manual QC review to minutes and ensures reproducibility across projects. Key steps include verifying merge integrity with line counts (zcat file | wc -l) or stream checksums, using the -t flag for parallel FastQC execution, and running MultiQC to generate clean, publication-ready reports. By standardizing this process, bioinformatics teams can catch quality issues early, like R2 quality degradation or adapter contamination, without drowning in individual HTML files. The complete workflow is captured in a single bash script that can be adapted for any sequencing project. References FastQC Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ MultiQC Documentation: https://multiqc.info/ Seqera MultiQC: https://seqera.io/multiqc/ Biostars: Running FastQC on multiple files: https://www.biostars.org/p/141797/ Illumina: Concatenating FASTQ files: Illumina Knowledge Base nf-core pipelines: https://nf-co.re/
Background

Get Started Now

Ready to See
Tracer In Action?

Start for free or
Tracer Logo

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy