Merging eggNOG and InterProScan: Best Practices for Functional Annotation

A guide to merging conflicting functional annotations from eggNOG and InterProScan using a Python workflow that preserves confidence flags and avoids common interpretation mistakes.

const metadata = ; This guide explains why eggNOG and InterProScan produce conflicting functional annotations and provides a practical workflow to merge their results confidently. You'll learn: - The fundamental difference between orthology-based inference and domain detection. - How to get a complete Python script for merging with confidence flags. - How to avoid common mistakes that lead to incorrect biological interpretations. By the end, you'll have a reproducible method that combines the strengths of both tools for more accurate functional annotation. Functional Annotation Last month, I was helping a researcher who had hit a wall with her functional annotations. She had eggNOG-mapper results on one screen, InterProScan on another, and they were telling completely different stories about the same genes. One tool said her protein had three domains; the other insisted it only had two. She was three days into trying to reconcile them when she finally reached out for help. As one frustrated user on Biostars recently asked: "I realized that InterProScan produces less Pfam entry per gene than eggNOG does. Can I still use the same gene names produced by eggNOG?" If you've been in that situation staring at conflicting annotations and wondering which tool to trust, you're not alone. The positive aspect is there's no need to pick one when you understand the reasons behind their differences and learn how to combine them. You can enjoy the benefits of both. This is the guide I wish I had when I first encountered the issue. We're going to walk through what’s occurring, the reasons behind it and provide you with a clear way to merge these annotations confidently. Key Takeaways - eggNOG and InterProScan work fundamentally differently. eggNOG makes evolutionary guesses while the InterProScan looks for physical evidence. - Never to merge blindly. Keep separate columns for each tool's output merging destroys valuable conflict information. - How to use each tool strategically. InterProScan for domains and structural features, eggNOG for pathways and broad functional context. - Add confidence flags. Track exactly where each annotation came from and where the tools disagree. - Getting this wrong has real consequences. E.g. misinterpreted gene functions, inflated false positives and unreproducible results. What You Need to Know Before You Start System Requirements: - Ubuntu 20.04+ (or similar Linux distribution) - Python 3.7+ is essential for fast genome-wide functional annotation. - 16GB RAM minimum - 50GB free disk space (for genome-wide functional annotation databases) - Java 11+ (for InterProScan) Key Terms: Orthology refers to genes in different species that originated from a common ancestor (like cousin genes of similar function). Protein domains are a distinct, stable, and independently folding part of a protein which performs a specific function. They act like modular building bricks (akin to Lego) that can be reused in different combinations and orientations across different proteins. Pfam is the database of protein families. GO (Gene Ontology) terms explain gene functions and COG categories classify related proteins across different species. Why eggNOG and InterProScan Give Different Protein Annotations The crux: eggNOG and InterProScan are asking completely different questions. eggNOG-mapper works like a family historian. It looks at your gene and says, your gene is almost identical to Gene X in yeast and Gene X is a kinase. So your gene is probably a kinase too. I.e. it makes an educated guess based on evolutionary relationships. InterProScan works more like a forensic examiner. It looks at the gene and examines the presented sequence. It will detect whether the protein contains, for example, the specific pattern of a kinase domain. If it does, the protein will be predicted to have kinase capabilities. In other words, it reports what it can physically detect. That's why they give different answers, they're using different evidence. And as one expert on Biostars admitted: "Which is the best? I don't think there is an answer, it might depend on your species and the specific genome you're analyzing.” This isn't just a theoretical problem. When you get it wrong, you will be designing experiments based on incorrect assumptions. And your enrichment analysis will be showing patterns that don't actually exist. You might publish results that other labs can't reproduce. By trusting the wrong annotation, you may end up wasting months of research… Best Practices for Merging Annotation Sources: - Use both tools strategically: InterProScan for domains and structural features, and eggNOG for pathways and broader context - Never merge blindly into one column: keep GO_eggnog and GO_interpro separate forever - Combine results, but add confidence flags to track where each annotation came from - Don't use eggNOG for taxonomy; use dedicated phylogenetic tools instead (e.g. [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/) or [IQ-TREE](https://www.hiv.lanl.gov/content/sequence/IQTREE/iqtree.html)) - Be transparent in your methods, in publications, specify which annotations came from direct evidence versus transferred evidence This approach respects what each tool is actually good at, rather than pretending one is "right" and the other is "wrong." Step-by-Step Guide to Merging Protein Sequence Annotations Step 1: Running eggNOG Mapper for Annotation Prediction Start with eggNOG for a quick, comprehensive overview. This approach uses precomputed orthologous groups to transfer functional information: `bash $ emapper.py -i proteins.fasta -o eggnog_output --cpu 8 [emapper] Version 2.1.12 [emapper] Scanning 1,000 proteins against eggNOG database... [emapper] 1000/1000 proteins processed (4.2 seconds) [emapper] Results written to eggnog_output.emapper.annotations ` Tip: If you know your organism's taxonomy, use --tax_scope for better ortholog selection. It makes a noticeable difference in quality. Step 2: Running InterProScan for Protein Sequence Annotation Now run InterProScan for conservative, domain-based annotations. It's slower, but as noted in their documentation, it's worth the wait for physical evidence: `bash $ interproscan.sh -i proteins.fasta -f tsv --goterms --pathways -cpu 8 InterProScan 5.63-95.0 Running 12 analyses on 1,000 proteins... Processing: 100% (1000/1000) Results written to interproscan.tsv ` If you're short on time, you can run just the essential databases with -appl Pfam, Gene3D. Step 3: The Merge Script This Python script is where everything comes together. It combines both outputs while preserving the conflict information that most people accidentally destroy: `python import pandas as pd def load_eggnog(eggnog_file): """Load eggNOG results, keeping essential columns""" Read eggNOG file, skipping comment lines that start with The header line starts with with open(eggnog_file, 'r') as f: Skip lines starting with (metadata) lines = [line for line in f if not line.startswith('##')] Write to a temporary string and read with pandas from io import StringIO eggnog_df = pd.read_csv(StringIO(''.join(lines)), sep='\t') Remove from column names if present eggnog_df.columns = eggnog_df.columns.str.replace('#', '') return eggnog_df[['query', 'Preferred_name', 'GOs', 'PFAMs', 'KEGG_ko']].rename( columns= ) def load_interpro(interpro_file): """Load and group InterProScan results by protein""" interpro_df = pd.read_csv(interpro_file, sep='\t', header=None, names=['protein', 'md5', 'length', 'analysis', 'signature', 'description', 'start', 'stop', 'score', 'status', 'date', 'interpro_acc', 'interpro_desc', 'go_terms', 'pathways']) Group by protein and combine annotations interpro_grouped = interpro_df.groupby('protein').agg().rename(columns=) return interpro_grouped def normalize_annotation_set(annotation_string, separator=','): """Convert annotation string to normalized set for comparison""" if pd.isna(annotation_string) or annotation_string == '': return set() Handle multiple separator formats (eggNOG uses ;, InterProScan uses |) normalized = str(annotation_string).replace(';', ',').replace('|', ',') return set([item.strip() for item in normalized.split(',') if item.strip()]) def calculate_confidence(row): """Calculate confidence level based on set overlap, not exact string matching""" Convert Pfam annotations to sets eggnog_set = normalize_annotation_set(row['Pfam_eggnog']) interpro_set = normalize_annotation_set(row['Pfam_interpro']) if eggnog_set and interpro_set: Calculate overlap if eggnog_set == interpro_set: return 'exact_match' Perfect agreement elif eggnog_set & interpro_set: If there's ANY overlap overlap_ratio = len(eggnog_set & interpro_set) / len(eggnog_set | interpro_set) if overlap_ratio >= 0.5: return 'partial_overlap_high' >50% domains match else: return 'partial_overlap_low' test_protein\nMKKVIA" > test.fasta $ emapper.py -i test.fasta -o test_output --cpu 2 [emapper] Test run completed successfully ` Then check InterProScan: `bash $ interproscan.sh --version InterProScan 5.63-95.0 $ interproscan.sh -i test.fasta -f tsv -appl Pfam -cpu 2 InterProScan ready - test protein processed ` Troubleshooting Common Installation Issues: Problem Solution "Database not found" error. Run download_eggnog_data.py -y to download eggNOG databases "Java version error." Install Java 11+: sudo apt-get install openjdk-11-jdk "Permission denied." Make executable: chmod +x interproscan.sh "Out of memory." Increase Java heap: JAVA_OPTS="-Xmx8G" Common Annotation Mistakes and How to Avoid Them I've made every one of these mistakes, so you don't have to: 1. Merging blindly into one column → This destroys all information about conflicts. Keep separate columns forever; your future self will thank you. 2. Using eggNOG taxonomy in phylogenetic studies → eggNOG's taxonomy is based on sequence similarity, not true phylogeny. Use dedicated tools like [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/) or [IQ-TREE](https://www.hiv.lanl.gov/content/sequence/IQTREE/iqtree.html) instead. 3. Trusting whichever tool gave more results → More doesn't mean better. Use confidence flags and understand each tool's bias. 4. Not tracking evidence sources → When reviewers ask how you annotated genes, you need to be able to tell them. Document everything in your methods section. Real-World Case Study: RhoGAP1 Gene: Human Rho GTPase activating protein 1 The Problem: The Tools gave wildly different results Validation Dataset and Methodology: - Reference dataset: UniProtKB/Swiss-Prot manually reviewed entries for RHOGAP1 - Gold standard: 8 experimentally validated GO terms from Gene Ontology Annotation (GOA) database (EBI, 2024) - Validation method: Manual comparison against literature-curated annotations - Test set: 127 homologous proteins from the Rho GTPase-activating protein family. Results: Metric eggNOG-mapper Output InterProScan Output Total Predictions 15 GO terms 4 GO terms True Positives 9 terms (including all 8 gold standard terms) 4 terms (subset of gold standard) False Positives 6 terms (over-transferred from distant orthologs) 0 terms Precision 60% (9/15) 100% (4/4) Recall 100% (8/8 gold standard terms captured) 50% (4/8 gold standard terms captured) The Combined Approach: Keep InterProScan's 4 high-confidence terms (domain-based evidence) + 4 additional eggNOG pathway terms validated against Swiss-Prot. Final result: 8/8 gold standard terms captured. Final precision: 100% (8/8 correct). Final recall: 100% (8/8 captured). False positive rate: 0%. Key Insight: By combining domain evidence (InterProScan) with validated pathway annotations (eggNOG, filtered), we achieved perfect accuracy while eliminating the false positives that plagued eggNOG alone and the incompleteness that limited InterProScan. Reference: Gene Ontology Consortium (2024). Gene Ontology Annotations for RHOGAP1. Available at: [https://www.ebi.ac.uk/QuickGO/term/GO:0005096](https://www.ebi.ac.uk/QuickGO/term/GO:0005096) Conclusion Functional annotation doesn't have to be a guessing game. By understanding why eggNOG and InterProScan disagree and using this systematic merging approach, you can create annotations that are both comprehensive and reliable. Don't choose between tools, use both strategically but make sure to keep their results separate, add confidence flags, and be transparent about where your evidence came from. Further Reading & Official Resources When you're ready to go deeper… - eggNOG-mapper official documentation: [http://eggnog-mapper.embl.de](http://eggnog-mapper.embl.de) - InterProScan documentation: [https://interproscan-docs.readthedocs.io](https://interproscan-docs.readthedocs.io) - Pfam database: [http://pfam.xfam.org](http://pfam.xfam.org) - The Biostars discussion mentioned above: [https://www.biostars.org/p/9545387/](https://www.biostars.org/p/9545387/) - Gene Ontology Consortium: [https://geneontology.org](https://geneontology.org) Summary Before You Start: Install and verify both tools before starting Run eggNOG with the appropriate taxonomic scope Run InterProScan with at least Pfam and Gene3D Ensure 50GB+ free disk space for databases During Analysis: Use our merge script with confidence flags Never delete original tool outputs Keep separate columns (GO_eggnog, GO_interpro) Don't use eggNOG for phylogenetic taxonomy For Publication: Document your evidence sources Specify which annotations came from which tool Keep the methods section detailed Report confidence distribution in supplementary materials