How to Properly Filter Missing Genotypes in bcftools
A practical guide to filtering missing genotypes in VCF files using bcftools, with examples and best practices for handling missing data in variant calling.
const metadata = ;
Introduction
Missing genotypes are a common challenge in variant calling and downstream analysis. Whether you're working with whole-genome sequencing, exome data, or genotyping arrays, understanding how to properly filter missing data in bcftools is essential for maintaining data quality while preserving valuable variants.
In this guide, we'll walk through the different approaches to filtering missing genotypes using bcftools, explain the syntax, and provide practical examples you can apply to your own data.
What Are Missing Genotypes?
In VCF files, missing genotypes are represented by ./. (for diploid) or . (for haploid). These occur when:
- The sequencing depth at a position was too low to make a confident call
- The variant caller couldn't determine the genotype
- The sample was not sequenced at that position
- Quality filters excluded the genotype
Basic bcftools Filter Syntax
The bcftools filter and bcftools view commands provide powerful options for handling missing genotypes. Here's the basic syntax:
`bash
Filter sites with missing genotypes
bcftools view -i 'F_MISSING 30 && INFO/DP > 10' input.vcf.gz -o high_quality.vcf.gz
`
This command filters for:
- Less than 10% missing genotypes
- Quality score above 30
- Depth greater than 10
Setting Missing Genotypes
You can also set low-quality genotypes to missing before filtering:
`bash
Set genotypes with GQ < 20 to missing
bcftools filter -S . -e 'FMT/GQ < 20' input.vcf.gz | \
bcftools view -i 'F_MISSING < 0.2' -o filtered.vcf.gz
`
The -S . flag sets genotypes to missing when they meet the exclusion criteria (-e).
Best Practices
1. Know your data: Check the missing rate distribution before choosing thresholds
2. Document your filters: Keep track of how many variants are removed at each step
3. Consider downstream analysis: GWAS may tolerate more missing data than phylogenetics
4. Use appropriate thresholds: 5-10% missing is common for population genetics; stricter for clinical
Quick Reference
| Expression | Description |
||-|
| F_MISSING | Fraction of samples with missing genotypes (0-1) |
| N_MISSING | Number of samples with missing genotypes |
| F_PASS | Fraction of samples that passed filters |
| N_PASS | Number of samples that passed filters |
Conclusion
Properly handling missing genotypes is crucial for variant analysis quality. bcftools provides flexible and efficient tools for filtering based on missing data rates. Start with exploratory analysis to understand your data's missing patterns, then apply appropriate thresholds based on your specific analysis requirements.
For more complex filtering needs, consider combining bcftools with other tools in your pipeline, and always validate your results by checking variant counts before and after filtering.