Bioinformatics Tutorial

Getting Started

Bioinformatics is applied data engineering for biology. You will spend time on (1) file formats, (2) command-line tooling, and (3) checking assumptions with plots and summary statistics.

What problems does bioinformatics solve?

Common questions include:

  • DNA: Which variants are present (SNPs/indels/CNVs)?
  • RNA: Which genes are expressed and which change across conditions?
  • Microbiome: Which taxa are present and how do communities differ?
  • Evolution: How are sequences related (trees) and what are selection signals?

Core skills

SkillWhy it matters
ShellMost tools are CLI-first; reproducible pipelines rely on scripts.
StatisticsCalling “significant” results without design/controls is a fast path to errors.
Data formatsFASTQ/BAM/VCF/GTF conventions determine correctness.
VisualizationPlots catch issues that summary numbers hide.

Prerequisites & Environment

We assume familiarity with the command line (Unix/Linux/macOS). Bioinformatics relies heavily on text streams and pipes.

  • Terminal Basics: cd, ls, grep, pipe |, redirection >.
  • Package Manager: Conda / Mamba (recommended) or Homebrew.
  • Hardware: These tutorials run on a laptop; real datasets often require a server or HPC cluster.

Quick Setup

Create an environment with common tools:

# Check if conda is installed
conda --version

# create a clean environment (recommended)
mamba create -n bioinfo fastqc cutadapt bwa samtools bcftools multiqc
mamba activate bioinfo
Data growth (illustrative)

Public archives grow quickly; automation and QC become essential.

Plot is illustrative, not an official archive metric.
A reproducible workflow checklist
Do
  • Pin reference versions (FASTA + annotation)
  • Save software versions (--version)
  • Keep an analysis manifest (samples, groups, lanes)
  • Track parameters (trim, align, filters)
  • Plot QC metrics before interpreting results
Avoid
  • Mixing genome builds (e.g., GRCh37 vs GRCh38)
  • Comparing groups without biological replicates
  • Filtering variants without considering depth/bias
  • Using a tool output without reading the field definitions
  • Over-trusting a single metric (always triangulate)