Read QC & Trimming
The goal of QC is not to make plots—it’s to decide whether data is usable and what preprocessing is needed. Good QC catches sample swaps, adapter contamination, low complexity reads, and systematic quality decay.
- Total reads and yield per sample/lane
- Per-base quality profiles
- Adapter/primer contamination signatures
- Per-read GC distribution and outliers
- Overrepresented sequences (possible contamination)
Typical tools
fastqc/multiqcfor reportingcutadapt,fastp,trimmomaticfor trimming
Example commands
# QC
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz
multiqc .
# Adapter trimming (illustrative)
cutadapt -j 4 -a ADAPTER_FWD -A ADAPTER_REV \
-q 20,20 --minimum-length 50 \
-o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
sample_R1.fastq.gz sample_R2.fastq.gz
Phred scores convert error probabilities into an additive scale:
Q = -10 * log10(p_error)
This means going from Q20 to Q30 is a 10× reduction in error probability. But trimming too aggressively can remove real signal—especially for low-input RNA-seq or ancient DNA.
Trim adapters confidently; trim qualities conservatively unless you have a reason.
If most reads require heavy trimming, check library prep and run quality.
Quality often decays toward the end of reads. The exact shape depends on platform and run conditions.
A rising adapter fraction at later cycles suggests short inserts relative to read length.
| Observation | Likely cause | Action |
|---|---|---|
| Sharp quality drop after cycle N | End-of-read decay | Consider trimming last cycles or use a quality cutoff |
| Adapter peaks | Short inserts | Trim adapters; evaluate minimum length |
| Unexpected GC peak | Contamination or biased capture | Screen contamination; check sample metadata |
| Many identical reads | PCR duplicates / low complexity | Consider deduplication; assess library complexity |
1. The "Sawtooth" Pattern
If per-base quality oscillates every few bases, it suggests a mechanical or optical failure during the run. This data is often unreliable for variant calling.
2. >90% Duplication Rate
If you have 50M reads but only 5M unique sequences, you essentially sequenced "noise". This often means input DNA was too low or PCR was over-cycled.