Read QC & Trimming

The goal of QC is not to make plots—it’s to decide whether data is usable and what preprocessing is needed. Good QC catches sample swaps, adapter contamination, low complexity reads, and systematic quality decay.

What to look at first

Total reads and yield per sample/lane
Per-base quality profiles
Adapter/primer contamination signatures
Per-read GC distribution and outliers
Overrepresented sequences (possible contamination)

Typical tools

fastqc / multiqc for reporting
cutadapt, fastp, trimmomatic for trimming

Example commands

# QC
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz
multiqc .

# Adapter trimming (illustrative)
cutadapt -j 4 -a ADAPTER_FWD -A ADAPTER_REV \
  -q 20,20 --minimum-length 50 \
  -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
  sample_R1.fastq.gz sample_R2.fastq.gz

How to think about Phred scores

Phred scores convert error probabilities into an additive scale:

Q = -10 * log10(p_error)

This means going from Q20 to Q30 is a 10× reduction in error probability. But trimming too aggressively can remove real signal—especially for low-input RNA-seq or ancient DNA.

Rule of thumb

Trim adapters confidently; trim qualities conservatively unless you have a reason.

Watch out

If most reads require heavy trimming, check library prep and run quality.

Per-base quality summary (example)

Quality often decays toward the end of reads. The exact shape depends on platform and run conditions.

Adapter signal across cycles (example)

A rising adapter fraction at later cycles suggests short inserts relative to read length.

Decision matrix

Observation	Likely cause	Action
Sharp quality drop after cycle N	End-of-read decay	Consider trimming last cycles or use a quality cutoff
Adapter peaks	Short inserts	Trim adapters; evaluate minimum length
Unexpected GC peak	Contamination or biased capture	Screen contamination; check sample metadata
Many identical reads	PCR duplicates / low complexity	Consider deduplication; assess library complexity

QC Nightmare Scenarios (When to re-sequence)

1. The "Sawtooth" Pattern

If per-base quality oscillates every few bases, it suggests a mechanical or optical failure during the run. This data is often unreliable for variant calling.

2. >90% Duplication Rate

If you have 50M reads but only 5M unique sequences, you essentially sequenced "noise". This often means input DNA was too low or PCR was over-cycled.