Why We're Building Tracer: The Future of Scientific Computing Observability
More

Handling Multi-Mapped Reads in STAR-Aligned Ribo-seq Data for TE Analysis

Multi-mapping is a key challenge in Ribo-seq data analysis, particularly when using STAR for alignment. This article explores whether multi-mapped reads should be kept or discarded when calculating translation efficiency.

const metadata = ; Introduction Multi-mapping is a key challenge in Ribo-seq data analysis, particularly when using STAR for alignment. Ribo-seq, a technique that profiles ribosome-bound mRNA fragments, generates short reads (~25–30 nt), which are prone to multi-mapping — aligning to more than one location in the genome or transcriptome. The question arises: Should multi-mapped reads be kept or discarded when calculating translation efficiency (TE) in Ribo-seq data aligned with STAR? This article explores this issue and provides practical guidance for bioinformatics engineers working with Ribo-seq data. Handling Multi-Mapped Reads: RNA-seq vs Ribo-seq In RNA-seq workflows, multi-mapped reads are relatively rare because RNA-seq reads are much longer (~75–150 nt). Only about 5–15% of RNA-seq reads typically align to multiple loci. For this reason, most RNA-seq analysis tools (e.g., STAR, featureCounts) exclude multi-mappers from gene quantification without much impact on downstream analysis ([Reddit, r/bioinformatics](https://www.reddit.com/r/bioinformatics/comments/hbyag2/how_do_you_deal_with_multi_mapped_reads_in_rna_seq/)). Ribo-seq, on the other hand, generates shorter reads (~28 nt), and the majority of reads often multi-map. A human HeLa Ribo-seq dataset showed that only ~14% of mapped reads were uniquely located, with the remaining reads mapping to multiple loci. Similarly, users report that with STAR, only 13-15% of reads map uniquely in human Ribo-seq ([Biostars](https://www.biostars.org/p/9581037/)). Simply discarding all multi-mapped reads could remove a substantial portion of the data, potentially losing valuable biological information, especially for genes in repetitive regions or gene families. However, including multi-mapped reads without handling them properly could inflate gene counts due to the same fragment being counted multiple times across loci. In practice, many Ribo-seq workflows choose to discard multi-mapped reads for clarity, relying on uniquely mapped reads to provide the most reliable measure of ribosome occupancy per gene. The downside is potential bias: genes with repetitive elements or conserved paralogs will lose more reads if multi-mappers are discarded. Therefore, it is essential to consider the biological context when deciding how to handle multi-mapped reads. Alignment Tweaks vs. Downstream Filtering One important question is whether alignment adjustments in STAR can help reduce multi-mapping, or whether it's better to handle multi-mappers downstream. While alignment tweaks can reduce some multi-mapping, they don't solve the problem entirely. Possible Technical Tweaks: rRNA contamination: If rRNA reads were not removed during library preparation, they dominate the data and often multi-map to multiple rDNA copies. This issue should be addressed by removing rRNA reads either through experimental depletion or filtering reads aligning to known rRNA sequences ([Biostars](https://www.biostars.org/p/9581037/)). Adapter contamination and low-quality reads: Trimming adapter sequences and low-quality tails can sometimes increase the unique mapping rate. For example, using bbduk for adapter trimming has been shown to significantly improve the unique fraction in some Ribo-seq datasets. Alignment parameters: More stringent alignment settings, such as end-to-end alignment (--alignEndsType EndToEnd) and strict mismatch limits (--outFilterMismatchNmax 2), can help reduce spurious alignments that lead to multi-mapping. However, these tweaks don't resolve the biological causes of multi-mapping, like repetitive sequences. Biological Multi-Mapping In Ribo-seq, many multi-mapped reads come from genuine repetitive sequences or gene families. STAR reports up to 10 alignments per read by default, and reads mapping to more than 10 loci are considered unmapped. Tweaking the --outFilterMultimapNmax option to limit the number of allowed alignments will simply force STAR to discard multi-mapped reads, without improving the unique mapping rate. While mapping to a transcriptome reference (instead of the whole genome) might reduce some multi-mapping by removing non-coding regions and repetitive intronic regions, ambiguity between similar transcripts remains, which does not fully address the multi-mapping issue. What is a "Normal" Multi-Mapping Rate in Ribo-seq? In RNA-seq, 95% of reads are typically uniquely aligned, with 5% multi-mapping. In Ribo-seq, much lower unique mapping rates are common, with values ranging from 10–30% unique, even after cleaning up contaminants like rRNA. For example, a human Ribo-seq dataset showed ~50% of reads mapped somewhere in the genome, but only 14% were uniquely mapped to a single location ([PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC8323483/)). If your dataset shows extremely low unique mapping rates (e.g., close to 0%), it could indicate rRNA contamination or adapter issues. High multi-mapping rates above 50% are common in Ribo-seq, especially with short reads that align to repetitive regions like rRNAs or tRNAs. After quality control (QC), multi-mapping rates in Ribo-seq above 50% are not a red flag in themselves but should be interpreted carefully. The key is ensuring that both Ribo-seq and matching RNA-seq datasets are processed consistently so that any multi-mapping biases are comparable between the two. How Different Tools Handle Multi-Mapped Reads Modern alignment and quantification tools take different approaches to multi-mapping: STAR and HISAT2 (spliced genome aligners): These aligners report multi-mapped reads but do not decide how to treat them. By default, STAR ignores multi-mappers in the GeneCounts mode, while HISAT2 reports only one "best" alignment per read but can be forced to report multiple alignments using the -k option. Pseudo-aligners (Salmon, Kallisto): These tools retain multi-mapped reads and apply an Expectation-Maximization (EM) algorithm to distribute the contribution of each multi-mapped read across the potential sources. This helps capture data from multi-mapped reads without over-counting them ([PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC8323483/)). RSEM and MMSEQ (alignment-based quantifiers): These tools compute probabilities for each read, distributing its count fractionally across genes or isoforms based on how many other reads map to each possible location. featureCounts and HTSeq: These tools typically drop multi-mapped reads from counting by default. Options are available to either include multi-mappers or use fractional counting (e.g., counting each read at multiple loci as 0.2 for five loci). mmquant: This tool aggregates genes that share multi-mapped reads and distributes the count so that these reads aren't double-counted, which is particularly useful for gene families. For most Ribo-seq TE studies, unique-only counting is the most straightforward and conservative method, but understanding how other tools treat multi-mappers can help if you want to include them. Should Multi-Mapped Reads Be Included in TE Quantification? When calculating TE (translation efficiency), it's important to handle multi-mapped reads consistently for both Ribo-seq and RNA-seq datasets. You can take one of three approaches: 1. Exclude multi-mapped reads entirely: Count only uniquely mapped reads for both RNA-seq and Ribo-seq. This is a common and simple approach that avoids ambiguous read assignments, but you lose coverage, especially for genes in repetitive regions. 2. Count multi-mappers naively: Count each alignment of a multi-mapped read as a separate read. This approach overcounts genes and is not recommended for TE analysis, as it can falsely inflate counts for multi-mapped regions. 3. Assign fractional weights to multi-mappers: Tools like Salmon, RSEM, and mmquant can distribute the read count of a multi-mapped read across all the genes or transcripts it maps to. This approach uses more data without double-counting, though it introduces complexity and requires careful normalization across datasets. For most gene-level TE studies, the safest approach is option 1: use only uniquely mapped reads for both Ribo-seq and RNA-seq to avoid introducing bias. When Do Multi-Mapped Reads Matter Biologically? There are several biological contexts in which multi-mapped reads are essential: Repetitive elements and transposons: Many repetitive sequences like transposons multi-map to all copies in the genome. If you discard multi-mappers, you lose the ability to track transposon translation. Instead, aggregate these reads at the family level. Gene families and paralogs: Highly similar genes within a family (e.g., histone genes, heat-shock proteins) will share large portions of their sequence, causing multi-mapping. Multi-mappers are relevant for gene family-level analysis, where you might sum the counts across the family instead of treating individual genes separately. Isoform-specific translation: For isoform-level analysis, multi-mapping is inevitable. Reads from shared exons in multiple isoforms will map to all relevant isoforms. Specialized tools like Ribotricer and isoform-aware quantifiers handle this by assigning multi-mapped reads to individual isoforms probabilistically. Polymorphic or duplicated genomic regions: In studies of copy number variation (CNV) or pseudogenes, multi-mapped reads could indicate new duplications or genomic segments that are not captured in the reference. These reads are important for understanding the biological duplication or variation in the genome. In summary, multi-mapped reads matter biologically when dealing with repetitive sequences, gene families, isoform expression, or polymorphisms. If your biological question involves any of these areas, don't discard multi-mappers — instead, model them appropriately. Table: TE Distortion With vs Without Multi-Mappers Feature With Multi-Mappers (Multi-mapper-Aware Strategy) Without Multi-Mappers (Discarded or Ignored) Read Assignment Ambiguously mapped reads are systematically assigned to their most probable source TE loci, often using statistical models or iterative algorithms (e.g., TEtranscripts). Multi-mapped reads are simply discarded from the analysis. TE Quantification Provides a more accurate, comprehensive, and less biased estimation of TE expression or abundance across the genome. Leads to significant under-quantification of TEs, as many reads originating from highly repetitive elements are ignored. Data Interpretation Enables the detection of a substantial portion of regulatory events and gene functions associated with TEs. Can result in a skewed view of the genome, potentially hindering the identification of critically relevant biological processes and causing artificial "distortion" in the results. Visual Appearance The visual representation (e.g., a coverage plot or a scatter plot of expression levels) would show a more complete and potentially higher signal for TEs, distributed across various loci. The visual would show low or no signal in many repetitive regions, with "gaps" or "dropouts" where multi-mappers were prevalent, creating a distorted picture of TE activity. Conclusion In Ribo-seq TE analysis, multi-mapped reads present both challenges and opportunities. For most analyses, discarding multi-mappers is the safest route, as it reduces ambiguity and ensures accurate, reproducible TE estimates. However, multi-mapped reads are essential in certain biological contexts, particularly when analyzing gene families, isoform translation, or repetitive elements. Therefore, always handle multi-mappers deliberately and consistently in both your Ribo-seq and RNA-seq datasets, and document your strategy. By doing so, you'll ensure that your TE calculations are both reliable and biologically meaningful. Summary This article addresses the challenge of multi-mapping in Ribo-seq data and its impact on translation efficiency (TE) analysis. It compares the handling of multi-mapped reads in RNA-seq versus Ribo-seq workflows, and discusses alignment strategies like STAR adjustments versus post-alignment filtering. The article outlines various tools and their approach to multi-mappers, presenting a set of best practices for ensuring robust TE quantification. In the biological context, the article also highlights cases where multi-mappers should not be discarded, such as repetitive sequences and gene families. By understanding how to handle multi-mapped reads in Ribo-seq data, bioinformatics engineers can make informed decisions on how to manage multi-mappers in their pipelines and improve the accuracy and reproducibility of their TE analyses. References 1. Biostars. "Multi-mapped reads in Ribo-Seq data, discard or keep?" 2. Reddit, r/bioinformatics. "STAR Aligner - High Percentage of Reads Mapped to Multiple Loci" 3. Reddit, r/bioinformatics. "How do you deal with multi-mapped reads in RNA-seq?" 4. PMC, NPW Wu et al., 2021. "A tool for analyzing and visualizing ribo-seq data at the isoform level." 5. Reddit, r/bioinformatics. "STAR alignments with very similar genes"
Background

Get Started Now

Ready to See
Tracer In Action?

Start for free or
Tracer Logo

Tracer is the first pipeline monitoring system purpose-built for high-compute workloads that lives in the OS.

2025 The Forge Software Inc. | A US Delaware Corporation, registered at 99 Wall Street, Suite 168 New York, NY 10005 | Terms & Conditions | Privacy Policy | Cookies Policy