Data Formats: FASTA, FASTQ, SAM/BAM/CRAM, VCF
Bioinformatics is file-format driven. If you understand what each file represents, every tool becomes easier to reason about.
FASTA stores sequences without quality scores. Each record is:
>sequence_id optional description
ACGTTGCA...
Common uses: reference genomes, transcriptomes, contigs, protein sequences.
Common gotchas
- Line breaks are irrelevant to sequence meaning.
- Ambiguous bases appear as
N(or IUPAC codes). - Headers are not standardized; pipelines often parse IDs.
FASTQ has 4 lines per read:
@read_id
ACGT...
+
FFF... (ASCII-encoded Phred qualities)
Phred score $Q$ relates to error probability $p$ by $Q=-10\log_{10}(p)$.
Example: $Q=30$ means $p=10^{-3}$ (β0.1% error).
SAM is text; BAM/CRAM are compressed binary equivalents. The key idea: reads are aligned to a reference with coordinates and a CIGAR string.
| Field | Meaning |
|---|---|
RNAME, POS | Reference contig and 1-based start position |
MAPQ | Mapping quality (alignment confidence) |
CIGAR | Match/insert/delete/clip operations |
FLAG | Bitwise flags (paired, reverse, secondary, duplicate, β¦) |
Practical tip: base quality answers βis this base call reliable?β while mapping quality answers βis this alignment placement reliable?β. They are different failure modes.
VCF is a table of genomic positions with alleles plus annotations and per-sample genotypes.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr1 123 . A G 99 PASS DP=42 GT:DP 0/1:40
If you only remember one thing: always read the INFO and FORMAT definitions in the VCF header; fields differ by caller.
k-mers help detect contamination and estimate genome characteristics (coverage/heterozygosity) in assembly workflows.
Coverage dips can indicate repeats, GC bias, or mapping ambiguity.