SAM format summary

I wrote the following in my old blog back in 2012, but the SAM / BAM format is still the de facto standard for working with (aligned) sequence data…

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read, the aligned position in the genome and information about its quality. It was developed by Heng Li (in Richard Durbin’s group at the Wellcome Trust Sanger Institute) and others, their paper is here.

After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:

1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

Fieldname	Description	Example-data

QNAME	read name	1:497:R:-272+13M17D24M
FLAG	alignment flag	113
RNAME	alignment chromosome	1
POS	alignment start position	497
MAPQ	overall mapping quality	37
CIGAR	alignment CIGAR string	37M
MRNM/RNEXT	name of next alignm. in group (mate)	15
MPOS/PNEXT	pos. of next alignm. in group (mate)	100338662
ISIZE/TLEN	observed Template LENgth	0
SEQ	sequence	CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG
QUAL	quality per base	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
TAGs	further tags with alignment info	XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

The read name QNAME (at least from Illumina machines) are constructed as:

[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]:
[x-pos]:[y-pos] [read number]:[is filtered]:[control number]:[barcode sequence]

for example: @M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4

To decode the meaning of the FLAGs in the above example and to allow filtering of reads using these flags there is a great page at the BROAD.

The TAGs are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).

TAG	Meaning
NM	Edit distance
MD	Mismatching positions/bases
AS	Alignment score
BC	Barcode sequence
X0	Number of best hits
X1	Number of suboptimal hits found by BWA
XN	Number of ambiguous bases in the referenece
XM	Number of mismatches in the alignment
XO	Number of gap opens
XG	Number of gap extentions
XT	Type: Unique/Repeat/N/Mate-sw
XA	Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS	Suboptimal alignment score
XF	Support from forward/reverse alignment
XE	Number of supporting seeds