Why We're Building Tracer: The Future of Scientific Computing Observability
More

Differential Gene Expression Analysis: RNA-Seq with Pathway Enrichment

A comprehensive tutorial on RNA-seq differential expression analysis using DESeq2 and pathway enrichment with clusterProfiler. Learn to identify genes and pathways responding to ALDH1A1 inhibition in ovarian cancer cells.

const metadata = ; Differential Gene Expression Analysis: RNA-Seq with Pathway Enrichment Use Case: ALDH1A1 inhibitor (compound 974) on OVCAR3 ovarian cancer cells Design: 3 DMSO control vs 3 treated samples (5µM, 48h) Goal: Identify genes and pathways responding to ALDH1A1 inhibition Why RNA-Seq Analysis Differs from Microarray The fundamental difference: RNA-seq produces discrete count data and microarray produces continuous intensities. Key Implications: Aspect Microarray RNA-Seq Data type Continuous intensities Discrete read counts Distribution Normal (after log) Negative binomial Zero values Rare (background signal) Common (true non-expression) Library size Same for all samples Varies per sample (needs normalization) Statistical test t-test on log values GLM with count distribution Why this matters: Using t-tests directly on count data is statistically inappropriate since count distributions do not meet the t-test assumptions. In RNA-seq, the variance increases with the mean; therefore, genes with low counts have significantly different levels of uncertainty than genes with high counts. Proper differential expression analysis requires specialized methods that appropriately reflect the mean-variance relationship. Tool Selection DESeq2 for Differential Expression Why DESeq2: DESeq2 is widely considered an industry-standard tool for differential expression analysis due to its robustness and extensive validation. It accurately models the variance-mean relationship by using dispersion shrinkage, which stabilises variance estimates across genes. The approach also incorporates built-in normalisation procedures to account for variations in library size between samples. Overall, DESeq2 is conservative in its statistical testing, thereby reducing the likelihood of false-positive results. clusterProfiler for Pathway Analysis Why clusterProfiler: ClusterProfiler is a robust and well-maintained Bioconductor package for functional enrichment and pathway analysis. It supports a wide range of annotation sources, such as GO, KEGG, Reactome, and MSigDb, making it extremely useful for biological interpretation. The package includes excellent visualisation choices, making it simple to express functional insights. Furthermore, clusterProfiler supports many organisms, boosting its adaptability to various sorts of genomic studies. Part 1: Differential Expression 1. Environment Setup `r Install (run once) if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("DESeq2", "apeglm", "org.Hs.eg.db", "clusterProfiler", "enrichplot", "DOSE", "pheatmap", "ggplot2", "ggrepel")) Load library(DESeq2) library(ggplot2) library(pheatmap) library(org.Hs.eg.db) library(clusterProfiler) library(enrichplot) dir.create("~/rna_seq_analysis", showWarnings = FALSE) setwd("~/rna_seq_analysis") set.seed(42) ` 2. Load Data `r Counts counts all(colnames(counts) == rownames(metadata)) Must be TRUE [1] TRUE > head(counts) GSM6040526 GSM6040527 GSM6040528 GSM6040529 GSM6040530 GSM6040531 ENSG00000223972 0 0 0 1 0 0 ENSG00000227232 5 5 4 6 1 9 ENSG00000278267 0 0 0 0 0 0 ENSG00000243485 0 0 0 0 0 0 ENSG00000284332 0 0 0 0 0 0 ENSG00000237613 0 0 0 0 0 0 > dim(counts) [1] 58395 6 > head(metadata) Sample_name Condition GSM6040526 OVCAR3_DMSO control_1 control GSM6040527 OVCAR3_DMSO control_2 control GSM6040528 OVCAR3_DMSO control_3 control GSM6040529 OVCAR3_974_1 treated GSM6040530 OVCAR3_974_2 treated GSM6040531 OVCAR3_974_3 treated ` This dataset (GSE200641) was collected from the NCBI GEO database and contains RNA-seq read counts for 58,395 genes from six OVCAR3 ovarian cancer cell lines. The experimental design includes three control samples treated with DMSO and three samples treated with the ALDH1A1 inhibitor compound 974, which allows for examination of treatment-induced transcriptional alterations. The counts file is a table that contains gene-level RNA-seq read counts for each sample. Each row names a gene by its Ensembl ID, and the columns correspond to the samples. The metadata file is a distinct table with sample names and experimental conditions (control or treated). Check: - All counts are integers (no decimals) - Gene IDs as rownames (Ensembl format) - 6 sample columns matching metadata All validation checks should return TRUE, indicating that the counts are integers, gene IDs follow the Ensembl format, and sample columns match the metadata. `r all(counts %% 1 == 0) TRUE: all integer counts all(grepl("^ENSG", rownames(counts))) TRUE: all Ensembl IDs all(colnames(counts) == metadata$GSM_ID) TRUE: column-metadata match ` 3. Create DESeq Object `r dds dds class: DESeqDataSet dim: 58395 6 metadata(1): version assays(1): counts rownames(58395): ENSG00000223972 ENSG00000227232 ... ENSG00000277475 ENSG00000268674 rowData names(0): colnames(6): GSM6040526 GSM6040527 ... GSM6040530 GSM6040531 colData names(2): Sample_name Condition ` Design = ~ Condition will test treated vs control 4. Pre-filter `r Keep genes with ≥10 total reads keep = 10 dds 0 (up) : 1165, 5.8% LFC head(as.data.frame(res_shrink)) baseMean log2FoldChange lfcSE pvalue padj symbol entrez ENSG00000227232 4.850715 0.014664765 0.1684407 0.63098526 NA ENSG00000269981 3.011504 0.015300304 0.1698089 0.50022705 NA ENSG00000241860 2.805192 -0.007335869 0.1688956 0.75444793 NA ENSG00000279457 42.122024 0.318923411 0.4603402 0.02122829 0.09357522 ENSG00000228463 4.103909 -0.008105332 0.1682606 0.76834074 NA RPL23AP21 728481 ENSG00000237094 2.882101 0.012887623 0.1695576 0.56407724 NA ` The dataset uses Ensembl gene IDs as rownames. For easy interpretability, we map these to gene symbols. Because most enrichment and pathway analysis tools perform best with Entrez IDs, we additionally convert the genes to their corresponding Entrez identifiers. 9. Filter Significant Genes `r Thresholds padj_cutoff 2-fold change Filter sig_genes lfc_cutoff) sig_genes 0), "\n") cat("Downregulated:", sum(sig_genes$log2FoldChange head(as.data.frame(sig_genes)) baseMean log2FoldChange lfcSE pvalue padj symbol entrez ENSG00000163347 14629.5449 -1.449817 0.1245526 1.797600e-32 2.679862e-28 CLDN1 9076 ENSG00000168209 2094.5182 -1.349249 0.1201882 1.723087e-30 1.284389e-26 DDIT4 54541 ENSG00000134954 2532.5570 -1.061181 0.1002469 2.065706e-27 8.490630e-24 ETS1 2113 ENSG00000139289 280.2012 -2.156727 0.2039613 2.278141e-27 8.490630e-24 PHLDA1 22822 ENSG00000138166 1978.4654 -1.163860 0.1155208 4.127276e-25 1.230589e-21 DUSP5 1847 ` Thresholds explained: - padj 1: >2-fold change (biologically meaningful) Why both? Statistical significance does not imply biological relevance. A tiny fold change (e.g., 1.02-fold) may be statistically significant but biologically meaningless. We used a significance threshold of padj 1, "Upregulated", ifelse(volcano_df$log2FoldChange 0, ] up_entrez head(as.data.frame(go_bp)) ID Description GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue GO:0050900 GO:0050900 leukocyte migration 20/111 395/18805 0.05063291 8.577945 11.728935 1.617415e-13 3.704243e-10 2.657306e-10 GO:0006935 GO:0006935 chemotaxis 21/111 466/18805 0.04506438 7.634555 11.175126 3.585098e-13 3.704243e-10 2.657306e-10 GO:0042330 GO:0042330 taxis 21/111 468/18805 0.04487179 7.601929 11.144616 3.893739e-13 3.704243e-10 2.657306e-10 GO:1990266 GO:1990266 neutrophil migration 11/111 107/18805 0.10280374 17.416435 13.122261 3.529102e-11 2.518014e-08 1.806343e-08 GO:0097530 GO:0097530 granulocyte migration 12/111 149/18805 0.08053691 13.644114 11.940094 7.878583e-11 4.497095e-08 3.226072e-08 GO:0042060 GO:0042060 wound healing 18/111 450/18805 0.04000000 6.776577 9.557297 1.453951e-10 6.915959e-08 4.961288e-08 geneID Count GO:0050900 THBS1/IL6/ICAM1/ITGB3/CXCL8/S100A8/CCL20/IL1A/CXCL6/EDN1/TNF/CD34/PECAM1/SAA1/TNFAIP6/CXCL5/CD69/S100A9/S1PR1/CXCL3 20 GO:0006935 THBS1/IL6/SEMA3C/FOSL1/ITGB3/CXCL1/CXCL8/S100A8/CCL20/FGF2/CXCL6/GPNMB/CXCL2/EDN1/SAA1/TNFAIP6/CXCL5/S100A9/S1PR1/ANGPT1/CXCL3 21 GO:0042330 THBS1/IL6/SEMA3C/FOSL1/ITGB3/CXCL1/CXCL8/S100A8/CCL20/FGF2/CXCL6/GPNMB/CXCL2/EDN1/SAA1/TNFAIP6/CXCL5/S100A9/S1PR1/ANGPT1/CXCL3 21 GO:1990266 CXCL8/S100A8/IL1A/CXCL6/EDN1/PECAM1/SAA1/TNFAIP6/CXCL5/S100A9/CXCL3 11 GO:0097530 THBS1/CXCL8/S100A8/IL1A/CXCL6/EDN1/PECAM1/SAA1/TNFAIP6/CXCL5/S100A9/CXCL3 12 GO:0042060 CLDN1/THBS1/IL6/SERPINE2/TNFAIP3/FN1/PLPP3/ITGB3/EREG/CD44/PLAT/IL1A/FGF2/EDN1/TNF/CD34/SAA1/ANGPT1 18 ` Gene enrichment analysis revealed significant enrichment for pathways related to immune cell movement and chemotaxis. Terms such as leukocyte migration, chemotaxis, neutrophil migration, and granulocyte migration exhibit high fold enrichment and very significant adjusted p-values, indicating that many differentially expressed genes play a role in immune cell-directed movement. Wound healing has also been significantly improved, with a focus on both tissue repair and immune responses. 3. KEGG Pathways `r kegg head(as.data.frame(kegg)) category subcategory ID Description GeneRatio hsa04657 Organismal Systems Immune system hsa04657 IL-17 signaling pathway 16/81 hsa05323 Human Diseases Immune disease hsa05323 Rheumatoid arthritis 14/81 hsa04668 Environmental Information Processing Signal transduction hsa04668 TNF signaling pathway 13/81 hsa04064 Environmental Information Processing Signal transduction hsa04064 NF-kappa B signaling pathway 10/81 hsa04061 Environmental Information Processing Signaling molecules and interaction hsa04061 Viral protein interaction with cytokine and cytokine receptor 9/81 hsa04060 Environmental Information Processing Signaling molecules and interaction hsa04060 Cytokine-cytokine receptor interaction 14/81 BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue hsa04657 95/9497 0.16842105 19.746849 17.031762 5.371127e-17 1.101081e-14 8.311113e-15 hsa05323 95/9497 0.14736842 17.278493 14.789228 3.947424e-14 4.046110e-12 3.054060e-12 hsa04668 119/9497 0.10924370 12.808486 12.022429 1.769296e-11 1.209019e-09 9.125840e-10 hsa04064 105/9497 0.09523810 11.166373 9.715413 1.728659e-08 8.859378e-07 6.687181e-07 hsa04061 100/9497 0.09000000 10.552222 8.906138 1.540710e-07 6.208221e-06 4.686051e-06 hsa04060 298/9497 0.04697987 5.508244 7.333730 1.817040e-07 6.208221e-06 4.686051e-06 geneID Count hsa04657 IL6/TNFAIP3/PTGS2/FOSL1/CXCL1/CXCL8/S100A8/CCL20/CXCL6/MAPK15/CXCL2/TNF/LCN2/CXCL5/S100A9/CXCL3 16 hsa05323 IL6/ICAM1/CXCL1/CXCL8/CCL20/IL1A/CXCL6/CXCL2/LTB/TNF/CXCL5/ATP6V1B1/ANGPT1/CXCL3 14 hsa04668 IL6/TNFAIP3/PTGS2/ICAM1/CXCL1/CCL20/BIRC3/CXCL6/CXCL2/EDN1/TNF/CXCL5/CXCL3 13 hsa04064 TNFAIP3/PTGS2/ICAM1/CXCL1/CXCL8/BIRC3/CXCL2/LTB/TNF/CXCL3 10 hsa04061 IL6/CXCL1/CXCL8/CCL20/CXCL6/CXCL2/TNF/CXCL5/CXCL3 9 hsa04060 IL6/CXCL1/CXCL8/IL7R/CCL20/IL1A/CXCL6/CXCL2/INHBE/IL32/LTB/TNF/CXCL5/CXCL3 14 ` KEGG pathway analysis reveals that differentially expressed genes are highly enriched in immunological and inflammatory pathways. IL-17 signalling, TNF signalling, NF-kappa B signalling, cytokine-cytokine receptor interaction, and rheumatoid arthritis are all important pathways with high fold enrichment and very significant adjusted p-values. This suggests that the gene set is mostly engaged in immunological response, inflammation, and cytokine-mediated signalling. 4. Compare Up vs Down `r go_up = 10 dds 2-fold change Filter sig_genes lfc_cutoff) sig_genes 1, "Upregulated", ifelse(volcano_df$log2FoldChange 0, ] up_entrez 1):", nrow(sig_genes), "\n") cat("- Upregulated genes:", sum(sig_genes$log2FoldChange > 0), "\n") cat("- Downregulated genes:", sum(sig_genes$log2FoldChange Data loaded (integer counts) Metadata matches sample order Pre-filtered (≥10 reads) DESeq2 Run completed Significant genes identified Results exported GO/KEGG enrichment done Upregulated Vs Downregulated enrichment comparison
Background

Get Started Now

Ready to See
Tracer In Action?

Start for free or
Tracer Logo

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy