SAM format summary

I wrote the following in my old blog back in 2012, but the SAM / BAM format is still the de facto standard for working with (aligned) sequence data…

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read, the aligned position in the genome and information about its quality. It was developed by Heng Li (in Richard Durbin’s group at the Wellcome Trust Sanger Institute) and others, their paper is here.

After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:

1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

QNAMEread name1:497:R:-272+13M17D24M
FLAGalignment flag113
RNAMEalignment chromosome1
POSalignment start position497
MAPQoverall mapping quality37
CIGARalignment CIGAR string37M
MRNM/RNEXTname of next alignm. in group (mate)15
MPOS/PNEXTpos. of next alignm. in group (mate)100338662
ISIZE/TLENobserved Template LENgth0
QUALquality per base0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
TAGsfurther tags with alignment infoXT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37

The read name QNAME (at least from Illumina machines) are constructed as:

[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]:
[x-pos]:[y-pos] [read number]:[is filtered]:[control number]:[barcode sequence]

for example: @M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4

To decode the meaning of the FLAGs in the above example and to allow filtering of reads using these flags there is a great page at the BROAD.

The TAGs are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).

NMEdit distance
MDMismatching positions/bases
ASAlignment score
BCBarcode sequence
X0Number of best hits
X1Number of suboptimal hits found by BWA
XNNumber of ambiguous bases in the referenece
XMNumber of mismatches in the alignment
XONumber of gap opens
XGNumber of gap extentions
XTType: Unique/Repeat/N/Mate-sw
XAAlternative hits; format: (chr,pos,CIGAR,NM;)*
XSSuboptimal alignment score
XFSupport from forward/reverse alignment
XENumber of supporting seeds

genome.sph.umich.ed with further useful details, full specs.
Image by PublicDomainPictures from Pixabay