Bioinformatics Tutorial

RNA-seq: quantification, design, and differential expression

RNA-seq is not just about finding a list of significant genes. The quality of the biological design, the consistency of the library preparation, and the clarity of the model formula usually matter more than the choice of one aligner or one plotting style.

Two common analysis strategies
StrategyTypical toolsWhy choose it
Genome alignmentSTAR, HISAT2Excellent for splice junctions, coverage QC, and novel feature discovery
Transcript quantificationSalmon, KallistoFast, efficient, and robust when a good transcriptome reference exists
Lightweight / mobile alignmentJapalitySpliceResource-conscious splice-aware alignment for constrained environments and educational experimentation

Minimal conceptual workflow

  1. QC and trim if needed
  2. Quantify genes or transcripts
  3. Build a design matrix that reflects the biology
  4. Normalize counts and inspect sample structure
  5. Fit differential expression models
  6. Interpret with pathways, markers, and experiment context
Statistics you cannot skip
  • Replicates: without biological replicates, DE claims are often weak or indefensible.
  • Batch effects: prep date, lane, donor, and operator can dominate expression structure.
  • Multiple testing: thousands of genes means raw p-values are not enough.
  • Effect size: tiny but significant differences are not automatically biologically meaningful.
  • Normalization: library size correction helps, but composition bias and unwanted variation still matter.
# Salmon quant (illustrative)
salmon index -t transcripts.fa -i tx_index
salmon quant -i tx_index -l A \
  -1 trimmed_R1.fastq.gz -2 trimmed_R2.fastq.gz \
  -p 8 -o sample_salmon
Interactive RNA-seq interpretation guide

Click a stage to see the main question it answers and the most common way learners get misled.

design replicates counts or TPM normalize compare batch inspect DE + interpretation

Design decides what you are allowed to conclude

Before thinking about software, define your comparison: treatment vs control, paired vs unpaired, donor effects, time points, and replicates. A weak design cannot be repaired statistically afterward.

  • Include replicates whenever you want differential expression claims.
  • Write the experimental factors explicitly before you run the pipeline.

Quantification converts reads into comparable summaries

Gene-level counts are commonly used for DE, while transcript-level abundance can be useful for isoform questions. Choose a summary that matches the question, not just the easiest tool.

  • Counts are typically preferred for DE models like DESeq2 or edgeR.
  • TPM is useful for within-sample expression profiles, not as a drop-in replacement for DE counts.

Normalization reduces unfair sample-to-sample differences

Different libraries have different depths and composition. Normalization tries to make expression more comparable without erasing true biology.

  • Large composition shifts can distort naive interpretations of raw counts.
  • Always inspect sample-level plots after normalization.

Batch structure can overwhelm biology

If samples cluster by run date or operator rather than condition, the model and interpretation must account for that. Ignoring batch effects can create elegant but false results.

  • PCA is often the fastest way to see hidden sample structure.
  • Do not blindly remove variation you do not understand.

Differential expression is the start of interpretation, not the end

A volcano plot is only an overview. The real work is understanding whether the changing genes fit the biology, known pathways, cell composition shifts, or technical confounding.

  • Read effect size and FDR together.
  • Validate key genes with annotation and biological knowledge.
Volcano plot (example)

Genes with large effect size and strong statistical support appear in the upper corners, but biology still decides which of them matter.

Marker overview (example)

A small marker set can be visualized quickly, but this is a communication aid, not a substitute for proper modeling.

PCA-style sample structure (example)

The safest RNA-seq projects show samples separating primarily by biology, not by library date or lane.

MA plot (example)

MA plots help you see whether expression changes are balanced, intensity-dependent, or dominated by low-count noise.

Design checklist before you trust a DE result

Replicates and metadata

Every sample should have condition labels, batch information, and any paired/blocked structure clearly recorded.

QC consistency

Outlier samples with bad QC can dominate DE results if left in without explanation.

Interpretation guardrails

Significant genes should be checked for annotation quality, known pathways, and plausibility within the experiment.