Bioinformatics Tutorial

Variant Calling (GATK/BCFtools): From Alignments to VCF

Variant calling converts aligned reads into hypotheses about differences from the reference. The hard part is not producing a VCF—it’s distinguishing true variants from artifacts (mapping bias, PCR, strand bias, low depth, contamination).

Typical DNA variant pipeline (high level)
  1. QC & trim
  2. Align to reference
  3. Mark duplicates (if applicable)
  4. Call variants
  5. Filter (hard filters or VQSR)
  6. Annotate + interpret

Example commands (illustrative)

# Basic calling with bcftools (example)
samtools mpileup -f reference.fa sample.bam \
  | bcftools call -mv -Oz -o sample.vcf.gz
bcftools index sample.vcf.gz

# Basic filtering idea (thresholds depend on experiment!)
bcftools filter -e 'DP<10 || QUAL<30' sample.vcf.gz -Oz -o sample.filtered.vcf.gz
How to read a VCF genotype

Genotypes in diploid samples are commonly:

  • 0/0 homozygous reference
  • 0/1 heterozygous
  • 1/1 homozygous alternate
  • ./. missing

Key fields

FieldMeaning
DPTotal depth
ADAllelic depths (REF, ALT…)
GQGenotype quality
AFAllele fraction (caller-defined)
Ti/Tv ratio across samples (example)

Ti/Tv is a quick plausibility metric (especially for exomes). Strange values can indicate calling issues or contamination.

Variant allele fraction histogram (example)

VAF shapes differ across germline vs somatic experiments, purity, copy number, and filters.

Common artifact checklist
Technical
  • Low complexity regions, repeats, segmental duplications
  • Strand bias (ALT supported mostly on one strand)
  • Position bias (ALT mostly at read ends)
  • Read-mapping ambiguity (low MAPQ)
Biological / design
  • Sample contamination / swaps
  • Unexpected ploidy / sex chromosomes
  • Tumor purity and subclonality (somatic)
  • Batch effects in capture/coverage