Bioinformatics Tutorial

Read QC & Trimming

QC is not a beauty contest for plots. The real question is whether your data are trustworthy enough for the biological task ahead, and if not, which intervention is justified. Good QC combines plot reading, sample context, and a bias toward minimal but effective preprocessing.

What to inspect first
  • Total yield and read count per sample or lane
  • Per-base quality decay across cycles
  • Adapter or primer contamination signatures
  • GC distribution and unusual secondary peaks
  • Overrepresented sequences and duplication patterns

Common tools

  • fastqc and multiqc for standardized reports
  • cutadapt, fastp, and trimmomatic for trimming and filtering
  • FastQLab for quick on-device FASTQ inspection and educational review when a workstation is not available

Example commands

# Generate QC reports
fastqc -t 4 sample_R1.fastq.gz sample_R2.fastq.gz
multiqc .

# Adapter trimming (illustrative)
cutadapt -j 4 -a ADAPTER_FWD -A ADAPTER_REV \
  -q 20,20 --minimum-length 50 \
  -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
  sample_R1.fastq.gz sample_R2.fastq.gz
Phred score intuition

Phred scores compress error probabilities into an additive scale:

Q = -10 * log10(p_error)

That means Q30 is not a little better than Q20-it is ten times lower error probability. Still, trimming solely to maximize average Q can destroy useful data. Think in terms of downstream risk, not cosmetic improvement.

Usually safe

Trim adapters decisively because they create artificial sequence content.

Use judgment

Quality trimming should be conservative when read length is already precious.

Interactive FastQC-style report reader

Click through the modules below. The goal is to learn what each panel means, what common failure patterns look like, and when the correct answer is "investigate more" instead of trimming blindly.

QC report navigator
Per-base quality
Adapter content
GC content
Duplication levels
Overrepresented sequences

Per-base quality

Look for where median quality falls and whether the lower tail becomes dangerously broad near the end of reads.

Adapter content

Late-cycle increase usually means inserts were shorter than the read length, so sequencing reads into adapter sequence.

GC content

A single clean peak is not always required, but unexpected extra peaks often justify contamination checks or metadata review.

Duplication levels

High duplication can reflect targeted sequencing, over-amplification, or low-complexity libraries. Interpretation depends on context.

Overrepresented reads

These may be adapters, primers, contamination, or genuinely abundant biological reads. Sequence identity matters.

How to read per-base quality

A gentle end-of-read decline is normal. What matters is whether the low tail becomes severe enough to disrupt mapping, overlap merging, or variant confidence.

  • If the drop is late and modest, trimming may be minimal or unnecessary.
  • If the whole run looks weak, trimming will not rescue a fundamentally poor library.

Adapter content is often more actionable than average quality

Adapter sequence introduces false bases. Removing it is usually high value because it directly improves alignment and feature assignment.

  • Rising late-cycle adapter content suggests inserts shorter than read length.
  • Choose a minimum length so heavily trimmed fragments do not become misleading noise.

GC content should be interpreted relative to experiment type

Amplicon, capture, metagenomic, and RNA-seq libraries naturally have different GC behavior. The question is whether the observed profile matches the expected biology and protocol.

  • Unexpected bimodality often suggests mixed organisms or contamination.
  • Highly shifted distributions can also reflect targeted enrichment bias.

Duplication is a context-dependent warning, not a universal failure

Whole-genome sequencing, RNA-seq, targeted sequencing, and amplicon data have different expectations. Use duplication together with library complexity and biological context.

  • Extremely high duplication in unbiased libraries often means low input or over-PCR.
  • In targeted assays, duplication may reflect true enrichment rather than failure.

Overrepresented sequences are clues

Always identify what the sequences are before deciding what to do. Some are technical artifacts; others may be real biology such as rRNA or highly abundant transcripts.

  • Match overrepresented sequences to adapter or primer databases when possible.
  • Cross-check with experiment type so you do not trim away expected signal.
Per-base quality summary (example)

Median, upper, and lower quantiles reveal when only a subset of reads is deteriorating versus when the whole run degrades.

Adapter signal across cycles (example)

A late-cycle surge is typical when fragments are shorter than the read length and sequencing enters adapter sequence.

Duplication profile (example)

This view helps separate healthy complexity from libraries dominated by repeated molecules.

Decision matrix
ObservationLikely causeGood next step
Quality collapses only at the tailTypical end-of-read decayConsider light trimming or leave untouched if downstream tools tolerate it
Strong adapter rise late in readsShort insertsTrim adapters and confirm post-trim length distribution
Unexpected GC peakContamination or mixed library compositionScreen contamination and revisit sample metadata
Very high duplication in unbiased libraryLow complexity / over-amplificationAssess whether re-sequencing or better library prep is needed
Same overrepresented sequence across many samplesTechnical contaminantMap the sequence identity before trimming or discarding
When re-sequencing is more honest than heroic trimming

Systemic run failure

If quality is poor from the beginning or oscillates in unusual patterns, no amount of trimming will create trustworthy data.

Library complexity collapse

If most reads are duplicates, your effective unique information is far smaller than the total read count suggests.

Severe contamination

If contaminant signatures dominate large fractions of the library, downstream interpretation may become more misleading than useful.