Getting Started

Bioinformatics is applied data engineering for biology. You will spend time on (1) file formats, (2) command-line tooling, and (3) checking assumptions with plots and summary statistics.

What problems does bioinformatics solve?

Common questions include:

DNA: Which variants are present (SNPs/indels/CNVs)?
RNA: Which genes are expressed and which change across conditions?
Microbiome: Which taxa are present and how do communities differ?
Evolution: How are sequences related (trees) and what are selection signals?

Core skills

Skill	Why it matters
Shell	Most tools are CLI-first; reproducible pipelines rely on scripts.
Statistics	Calling “significant” results without design/controls is a fast path to errors.
Data formats	FASTQ/BAM/VCF/GTF conventions determine correctness.
Visualization	Plots catch issues that summary numbers hide.

Prerequisites & Environment

We assume familiarity with the command line (Unix/Linux/macOS). Bioinformatics relies heavily on text streams and pipes.

Terminal Basics: cd, ls, grep, pipe |, redirection >.
Package Manager: Conda / Mamba (recommended) or Homebrew.
Hardware: These tutorials run on a laptop; real datasets often require a server or HPC cluster.

Quick Setup

Create an environment with common tools:

# Check if conda is installed
conda --version

# create a clean environment (recommended)
mamba create -n bioinfo fastqc cutadapt bwa samtools bcftools multiqc
mamba activate bioinfo

Data growth (illustrative)

Public archives grow quickly; automation and QC become essential.

Plot is illustrative, not an official archive metric.

A reproducible workflow checklist

Pin reference versions (FASTA + annotation)
Save software versions (--version)
Keep an analysis manifest (samples, groups, lanes)
Track parameters (trim, align, filters)
Plot QC metrics before interpreting results

Avoid

Mixing genome builds (e.g., GRCh37 vs GRCh38)
Comparing groups without biological replicates
Filtering variants without considering depth/bias
Using a tool output without reading the field definitions
Over-trusting a single metric (always triangulate)