Bioinformatics Tutorial

Single-cell RNA-seq: filtering cells, defining clusters, and avoiding false stories

Single-cell workflows are powerful because they separate heterogeneous cell states-but they are also easy to over-interpret. QC thresholds, batch handling, doublets, and annotation strategy shape the final biological story.

Typical workflow
StageMain output
Barcode / UMI processingCell-by-gene count matrix
Cell QC filteringHigh-quality cell subset
Normalization + HVG selectionComparable expression matrix
Dimensionality reductionPCA / UMAP coordinates
Clustering + annotationCell states / identities
Marker analysisGenes that separate clusters or conditions
Artifacts you must rule out
  • Ambient RNA: free-floating RNA contaminates droplets and distorts weakly expressed markers
  • Doublets: two cells captured together can look like a fake hybrid cell type
  • Mitochondrial RNA: often increases in stressed or damaged cells
  • Batch effects: chemistry, run date, and donor handling can dominate clustering
  • Over-clustering: fine clusters are not always distinct biological cell types
Interactive cell filtering sandbox

Adjust the QC thresholds and see how many synthetic cells remain. This helps build intuition for the tradeoff between removing low-quality cells and accidentally discarding rare but real populations.

Very low-gene cells are often empty droplets or damaged cells.
Low UMI counts reduce stability of downstream clustering and marker calls.
High mitochondrial fraction often marks stressed or dying cells.
UMAP-like embedding (example)

Cluster separation in 2D is helpful for intuition, but cluster boundaries are still shaped by preprocessing, resolution choice, and batch handling.

Cluster sizes (example)

Very tiny clusters can represent rare populations, but they can also reflect doublets, broken cells, or over-clustering.

How to annotate clusters without fooling yourself

Use marker combinations

One marker gene is rarely enough. Look for coherent marker sets and pathway logic, not a single famous gene.

Check QC overlays

If one cluster is mostly high-mito cells or low-count cells, it may be technical, not biological.

Respect donor / batch structure

A cluster that appears in only one batch may be real-or just a processing artifact.

Single-cell interpretation checklist
  • Report QC thresholds and justify why they were chosen.
  • Describe how doublets and ambient RNA were handled.
  • State whether integration / batch correction was applied and why.
  • Use multiple markers and known biology when naming clusters.
  • Be careful when treating cluster-level DE as cell-type discovery without validation.