Differential Gene Expression Analysis: RNA-Seq with Pathway Enrichment
A comprehensive tutorial on RNA-seq differential expression analysis using DESeq2 and pathway enrichment with clusterProfiler. Learn to identify genes and pathways responding to ALDH1A1 inhibition in ovarian cancer cells.
const metadata = ;
Differential Gene Expression Analysis: RNA-Seq with Pathway Enrichment
Use Case: ALDH1A1 inhibitor (compound 974) on OVCAR3 ovarian cancer cells
Design: 3 DMSO control vs 3 treated samples (5µM, 48h)
Goal: Identify genes and pathways responding to ALDH1A1 inhibition
Why RNA-Seq Analysis Differs from Microarray
The fundamental difference: RNA-seq produces discrete count data and microarray produces continuous intensities.
Key Implications:
Aspect
Microarray
RNA-Seq
Data type
Continuous intensities
Discrete read counts
Distribution
Normal (after log)
Negative binomial
Zero values
Rare (background signal)
Common (true non-expression)
Library size
Same for all samples
Varies per sample (needs normalization)
Statistical test
t-test on log values
GLM with count distribution
Why this matters:
Using t-tests directly on count data is statistically inappropriate since count distributions do not meet the t-test assumptions. In RNA-seq, the variance increases with the mean; therefore, genes with low counts have significantly different levels of uncertainty than genes with high counts. Proper differential expression analysis requires specialized methods that appropriately reflect the mean-variance relationship.
Tool Selection
DESeq2 for Differential Expression
Why DESeq2: DESeq2 is widely considered an industry-standard tool for differential expression analysis due to its robustness and extensive validation. It accurately models the variance-mean relationship by using dispersion shrinkage, which stabilises variance estimates across genes. The approach also incorporates built-in normalisation procedures to account for variations in library size between samples. Overall, DESeq2 is conservative in its statistical testing, thereby reducing the likelihood of false-positive results.
clusterProfiler for Pathway Analysis
Why clusterProfiler: ClusterProfiler is a robust and well-maintained Bioconductor package for functional enrichment and pathway analysis. It supports a wide range of annotation sources, such as GO, KEGG, Reactome, and MSigDb, making it extremely useful for biological interpretation. The package includes excellent visualisation choices, making it simple to express functional insights. Furthermore, clusterProfiler supports many organisms, boosting its adaptability to various sorts of genomic studies.
Part 1: Differential Expression
1. Environment Setup
`r
Install (run once)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "apeglm", "org.Hs.eg.db",
"clusterProfiler", "enrichplot", "DOSE",
"pheatmap", "ggplot2", "ggrepel"))
Load
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(org.Hs.eg.db)
library(clusterProfiler)
library(enrichplot)
dir.create("~/rna_seq_analysis", showWarnings = FALSE)
setwd("~/rna_seq_analysis")
set.seed(42)
`
2. Load Data
`r
Counts
counts all(colnames(counts) == rownames(metadata)) Must be TRUE
[1] TRUE
> head(counts)
GSM6040526 GSM6040527 GSM6040528 GSM6040529 GSM6040530 GSM6040531
ENSG00000223972 0 0 0 1 0 0
ENSG00000227232 5 5 4 6 1 9
ENSG00000278267 0 0 0 0 0 0
ENSG00000243485 0 0 0 0 0 0
ENSG00000284332 0 0 0 0 0 0
ENSG00000237613 0 0 0 0 0 0
> dim(counts)
[1] 58395 6
> head(metadata)
Sample_name Condition
GSM6040526 OVCAR3_DMSO control_1 control
GSM6040527 OVCAR3_DMSO control_2 control
GSM6040528 OVCAR3_DMSO control_3 control
GSM6040529 OVCAR3_974_1 treated
GSM6040530 OVCAR3_974_2 treated
GSM6040531 OVCAR3_974_3 treated
`
This dataset (GSE200641) was collected from the NCBI GEO database and contains RNA-seq read counts for 58,395 genes from six OVCAR3 ovarian cancer cell lines. The experimental design includes three control samples treated with DMSO and three samples treated with the ALDH1A1 inhibitor compound 974, which allows for examination of treatment-induced transcriptional alterations.
The counts file is a table that contains gene-level RNA-seq read counts for each sample. Each row names a gene by its Ensembl ID, and the columns correspond to the samples. The metadata file is a distinct table with sample names and experimental conditions (control or treated).
Check:
- All counts are integers (no decimals)
- Gene IDs as rownames (Ensembl format)
- 6 sample columns matching metadata
All validation checks should return TRUE, indicating that the counts are integers, gene IDs follow the Ensembl format, and sample columns match the metadata.
`r
all(counts %% 1 == 0) TRUE: all integer counts
all(grepl("^ENSG", rownames(counts))) TRUE: all Ensembl IDs
all(colnames(counts) == metadata$GSM_ID) TRUE: column-metadata match
`
3. Create DESeq Object
`r
dds dds
class: DESeqDataSet
dim: 58395 6
metadata(1): version
assays(1): counts
rownames(58395): ENSG00000223972 ENSG00000227232 ... ENSG00000277475 ENSG00000268674
rowData names(0):
colnames(6): GSM6040526 GSM6040527 ... GSM6040530 GSM6040531
colData names(2): Sample_name Condition
`
Design = ~ Condition will test treated vs control
4. Pre-filter
`r
Keep genes with ≥10 total reads
keep = 10
dds 0 (up) : 1165, 5.8%
LFC head(as.data.frame(res_shrink))
baseMean log2FoldChange lfcSE pvalue padj symbol entrez
ENSG00000227232 4.850715 0.014664765 0.1684407 0.63098526 NA
ENSG00000269981 3.011504 0.015300304 0.1698089 0.50022705 NA
ENSG00000241860 2.805192 -0.007335869 0.1688956 0.75444793 NA
ENSG00000279457 42.122024 0.318923411 0.4603402 0.02122829 0.09357522
ENSG00000228463 4.103909 -0.008105332 0.1682606 0.76834074 NA RPL23AP21 728481
ENSG00000237094 2.882101 0.012887623 0.1695576 0.56407724 NA
`
The dataset uses Ensembl gene IDs as rownames. For easy interpretability, we map these to gene symbols. Because most enrichment and pathway analysis tools perform best with Entrez IDs, we additionally convert the genes to their corresponding Entrez identifiers.
9. Filter Significant Genes
`r
Thresholds
padj_cutoff 2-fold change
Filter
sig_genes lfc_cutoff)
sig_genes 0), "\n")
cat("Downregulated:", sum(sig_genes$log2FoldChange head(as.data.frame(sig_genes))
baseMean log2FoldChange lfcSE pvalue padj symbol entrez
ENSG00000163347 14629.5449 -1.449817 0.1245526 1.797600e-32 2.679862e-28 CLDN1 9076
ENSG00000168209 2094.5182 -1.349249 0.1201882 1.723087e-30 1.284389e-26 DDIT4 54541
ENSG00000134954 2532.5570 -1.061181 0.1002469 2.065706e-27 8.490630e-24 ETS1 2113
ENSG00000139289 280.2012 -2.156727 0.2039613 2.278141e-27 8.490630e-24 PHLDA1 22822
ENSG00000138166 1978.4654 -1.163860 0.1155208 4.127276e-25 1.230589e-21 DUSP5 1847
`
Thresholds explained:
- padj 1: >2-fold change (biologically meaningful)
Why both? Statistical significance does not imply biological relevance. A tiny fold change (e.g., 1.02-fold) may be statistically significant but biologically meaningless.
We used a significance threshold of padj 1, "Upregulated",
ifelse(volcano_df$log2FoldChange 0, ]
up_entrez head(as.data.frame(go_bp))
ID Description GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue
GO:0050900 GO:0050900 leukocyte migration 20/111 395/18805 0.05063291 8.577945 11.728935 1.617415e-13 3.704243e-10 2.657306e-10
GO:0006935 GO:0006935 chemotaxis 21/111 466/18805 0.04506438 7.634555 11.175126 3.585098e-13 3.704243e-10 2.657306e-10
GO:0042330 GO:0042330 taxis 21/111 468/18805 0.04487179 7.601929 11.144616 3.893739e-13 3.704243e-10 2.657306e-10
GO:1990266 GO:1990266 neutrophil migration 11/111 107/18805 0.10280374 17.416435 13.122261 3.529102e-11 2.518014e-08 1.806343e-08
GO:0097530 GO:0097530 granulocyte migration 12/111 149/18805 0.08053691 13.644114 11.940094 7.878583e-11 4.497095e-08 3.226072e-08
GO:0042060 GO:0042060 wound healing 18/111 450/18805 0.04000000 6.776577 9.557297 1.453951e-10 6.915959e-08 4.961288e-08
geneID Count
GO:0050900 THBS1/IL6/ICAM1/ITGB3/CXCL8/S100A8/CCL20/IL1A/CXCL6/EDN1/TNF/CD34/PECAM1/SAA1/TNFAIP6/CXCL5/CD69/S100A9/S1PR1/CXCL3 20
GO:0006935 THBS1/IL6/SEMA3C/FOSL1/ITGB3/CXCL1/CXCL8/S100A8/CCL20/FGF2/CXCL6/GPNMB/CXCL2/EDN1/SAA1/TNFAIP6/CXCL5/S100A9/S1PR1/ANGPT1/CXCL3 21
GO:0042330 THBS1/IL6/SEMA3C/FOSL1/ITGB3/CXCL1/CXCL8/S100A8/CCL20/FGF2/CXCL6/GPNMB/CXCL2/EDN1/SAA1/TNFAIP6/CXCL5/S100A9/S1PR1/ANGPT1/CXCL3 21
GO:1990266 CXCL8/S100A8/IL1A/CXCL6/EDN1/PECAM1/SAA1/TNFAIP6/CXCL5/S100A9/CXCL3 11
GO:0097530 THBS1/CXCL8/S100A8/IL1A/CXCL6/EDN1/PECAM1/SAA1/TNFAIP6/CXCL5/S100A9/CXCL3 12
GO:0042060 CLDN1/THBS1/IL6/SERPINE2/TNFAIP3/FN1/PLPP3/ITGB3/EREG/CD44/PLAT/IL1A/FGF2/EDN1/TNF/CD34/SAA1/ANGPT1 18
`
Gene enrichment analysis revealed significant enrichment for pathways related to immune cell movement and chemotaxis. Terms such as leukocyte migration, chemotaxis, neutrophil migration, and granulocyte migration exhibit high fold enrichment and very significant adjusted p-values, indicating that many differentially expressed genes play a role in immune cell-directed movement. Wound healing has also been significantly improved, with a focus on both tissue repair and immune responses.
3. KEGG Pathways
`r
kegg head(as.data.frame(kegg))
category subcategory ID Description GeneRatio
hsa04657 Organismal Systems Immune system hsa04657 IL-17 signaling pathway 16/81
hsa05323 Human Diseases Immune disease hsa05323 Rheumatoid arthritis 14/81
hsa04668 Environmental Information Processing Signal transduction hsa04668 TNF signaling pathway 13/81
hsa04064 Environmental Information Processing Signal transduction hsa04064 NF-kappa B signaling pathway 10/81
hsa04061 Environmental Information Processing Signaling molecules and interaction hsa04061 Viral protein interaction with cytokine and cytokine receptor 9/81
hsa04060 Environmental Information Processing Signaling molecules and interaction hsa04060 Cytokine-cytokine receptor interaction 14/81
BgRatio RichFactor FoldEnrichment zScore pvalue p.adjust qvalue
hsa04657 95/9497 0.16842105 19.746849 17.031762 5.371127e-17 1.101081e-14 8.311113e-15
hsa05323 95/9497 0.14736842 17.278493 14.789228 3.947424e-14 4.046110e-12 3.054060e-12
hsa04668 119/9497 0.10924370 12.808486 12.022429 1.769296e-11 1.209019e-09 9.125840e-10
hsa04064 105/9497 0.09523810 11.166373 9.715413 1.728659e-08 8.859378e-07 6.687181e-07
hsa04061 100/9497 0.09000000 10.552222 8.906138 1.540710e-07 6.208221e-06 4.686051e-06
hsa04060 298/9497 0.04697987 5.508244 7.333730 1.817040e-07 6.208221e-06 4.686051e-06
geneID Count
hsa04657 IL6/TNFAIP3/PTGS2/FOSL1/CXCL1/CXCL8/S100A8/CCL20/CXCL6/MAPK15/CXCL2/TNF/LCN2/CXCL5/S100A9/CXCL3 16
hsa05323 IL6/ICAM1/CXCL1/CXCL8/CCL20/IL1A/CXCL6/CXCL2/LTB/TNF/CXCL5/ATP6V1B1/ANGPT1/CXCL3 14
hsa04668 IL6/TNFAIP3/PTGS2/ICAM1/CXCL1/CCL20/BIRC3/CXCL6/CXCL2/EDN1/TNF/CXCL5/CXCL3 13
hsa04064 TNFAIP3/PTGS2/ICAM1/CXCL1/CXCL8/BIRC3/CXCL2/LTB/TNF/CXCL3 10
hsa04061 IL6/CXCL1/CXCL8/CCL20/CXCL6/CXCL2/TNF/CXCL5/CXCL3 9
hsa04060 IL6/CXCL1/CXCL8/IL7R/CCL20/IL1A/CXCL6/CXCL2/INHBE/IL32/LTB/TNF/CXCL5/CXCL3 14
`
KEGG pathway analysis reveals that differentially expressed genes are highly enriched in immunological and inflammatory pathways. IL-17 signalling, TNF signalling, NF-kappa B signalling, cytokine-cytokine receptor interaction, and rheumatoid arthritis are all important pathways with high fold enrichment and very significant adjusted p-values. This suggests that the gene set is mostly engaged in immunological response, inflammation, and cytokine-mediated signalling.
4. Compare Up vs Down
`r
go_up = 10
dds 2-fold change
Filter
sig_genes lfc_cutoff)
sig_genes 1, "Upregulated",
ifelse(volcano_df$log2FoldChange 0, ]
up_entrez 1):", nrow(sig_genes), "\n")
cat("- Upregulated genes:", sum(sig_genes$log2FoldChange > 0), "\n")
cat("- Downregulated genes:", sum(sig_genes$log2FoldChange
Data loaded (integer counts)
Metadata matches sample order
Pre-filtered (≥10 reads)
DESeq2 Run completed
Significant genes identified
Results exported
GO/KEGG enrichment done
Upregulated Vs Downregulated enrichment comparison