Getting Started
Bioinformatics is applied data engineering for biology. You will spend time on (1) file formats, (2) command-line tooling, and (3) checking assumptions with plots and summary statistics.
What problems does bioinformatics solve?
Common questions include:
- DNA: Which variants are present (SNPs/indels/CNVs)?
- RNA: Which genes are expressed and which change across conditions?
- Microbiome: Which taxa are present and how do communities differ?
- Evolution: How are sequences related (trees) and what are selection signals?
Core skills
| Skill | Why it matters |
|---|---|
| Shell | Most tools are CLI-first; reproducible pipelines rely on scripts. |
| Statistics | Calling “significant” results without design/controls is a fast path to errors. |
| Data formats | FASTQ/BAM/VCF/GTF conventions determine correctness. |
| Visualization | Plots catch issues that summary numbers hide. |
Prerequisites & Environment
We assume familiarity with the command line (Unix/Linux/macOS). Bioinformatics relies heavily on text streams and pipes.
- Terminal Basics:
cd,ls,grep, pipe|, redirection>. - Package Manager: Conda / Mamba (recommended) or Homebrew.
- Hardware: These tutorials run on a laptop; real datasets often require a server or HPC cluster.
Quick Setup
Create an environment with common tools:
# Check if conda is installed
conda --version
# create a clean environment (recommended)
mamba create -n bioinfo fastqc cutadapt bwa samtools bcftools multiqc
mamba activate bioinfo
Data growth (illustrative)
Public archives grow quickly; automation and QC become essential.
Plot is illustrative, not an official archive metric.
A reproducible workflow checklist
Do
- Pin reference versions (FASTA + annotation)
- Save software versions (
--version) - Keep an analysis manifest (samples, groups, lanes)
- Track parameters (trim, align, filters)
- Plot QC metrics before interpreting results
Avoid
- Mixing genome builds (e.g., GRCh37 vs GRCh38)
- Comparing groups without biological replicates
- Filtering variants without considering depth/bias
- Using a tool output without reading the field definitions
- Over-trusting a single metric (always triangulate)