Bioinformatics Tutorial

Alignment (BWA/Bowtie2) and What the Numbers Mean

Alignment places reads onto a reference genome/transcriptome. You should treat alignment as a probabilistic statement—especially in repeats and low-complexity regions.

Key concepts
  • Primary vs secondary alignments for multi-mapping reads
  • Soft clipping indicates partial matches (adapters, SVs, errors)
  • MAPQ reflects placement ambiguity, not base-call quality
  • Proper pair depends on orientation and insert size expectations

Typical commands

# Index reference (BWA)
bwa index reference.fa

# Align paired-end reads
bwa mem -t 8 reference.fa trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
  | samtools sort -@ 4 -o sample.bam

# Quick stats
samtools flagstat sample.bam
samtools stats sample.bam | head
Sanity checks
MetricInterpretation
% mappedLow mapping may indicate contamination, wrong reference, or low quality
% duplicatesHigh duplicates suggest low library complexity or over-amplification
Insert sizeUnexpected distribution can signal library prep issues
Coverage uniformityBias suggests GC/capture effects or mapping problems
Healthy

High mapping, expected insert sizes, stable coverage.

Investigate

Low MAPQ reads pile up in repeats; filter carefully.

Mapping summary (example)
Insert size distribution (example)
A note on reference choice

Always align to the correct genome build and annotation version. For RNA-seq, align to genome + spliced aligner (STAR/HISAT2), or use pseudoalignment (Salmon/Kallisto) to a transcriptome. Mixing builds invalidates coordinates and downstream interpretation.