I wrote the following in my old blog back in 2012, but the SAM / BAM format is still the de facto standard for working with (aligned) sequence data…
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It is a text format for storing sequence data in a series of tab delimited ASCII columns and is commonly used in next-generation sequencing data processing. It is the (non-binary) human-readable version of the BAM format and contains information about the read, the aligned position in the genome and information about its quality. It was developed by Heng Li (in Richard Durbin’s group at the Wellcome Trust Sanger Institute) and others, their paper is here.
After a header section the alignment section describes all results of the aligned read data. The format is best explained with an example line:
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
Fieldname | Description | Example-data |
---|---|---|
QNAME | read name | 1:497:R:-272+13M17D24M |
FLAG | alignment flag | 113 |
RNAME | alignment chromosome | 1 |
POS | alignment start position | 497 |
MAPQ | overall mapping quality | 37 |
CIGAR | alignment CIGAR string | 37M |
MRNM/RNEXT | name of next alignm. in group (mate) | 15 |
MPOS/PNEXT | pos. of next alignm. in group (mate) | 100338662 |
ISIZE/TLEN | observed Template LENgth | 0 |
SEQ | sequence | CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG |
QUAL | quality per base | 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> |
TAGs | further tags with alignment info | XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 |
The read name QNAME (at least from Illumina machines) are constructed as:
[instrument-name]:[run ID]:[flowcell ID]:[lane-number]:[tile-number]: [x-pos]:[y-pos] [read number]:[is filtered]:[control number]:[barcode sequence]
for example: @M01117:25:000000000-A37B9:1:1101:14984:1386 1:N:0:4
To decode the meaning of the FLAGs in the above example and to allow filtering of reads using these flags there is a great page at the BROAD.
The TAGs are optional and might vary between alignment programs. Shown are examples from BWA. Important for filtering are usually the tags X0:i (numbers of genome alignments of this read) and XM:i (number of mismatches in alignment).
TAG | Meaning |
---|---|
NM | Edit distance |
MD | Mismatching positions/bases |
AS | Alignment score |
BC | Barcode sequence |
X0 | Number of best hits |
X1 | Number of suboptimal hits found by BWA |
XN | Number of ambiguous bases in the referenece |
XM | Number of mismatches in the alignment |
XO | Number of gap opens |
XG | Number of gap extentions |
XT | Type: Unique/Repeat/N/Mate-sw |
XA | Alternative hits; format: (chr,pos,CIGAR,NM;)* |
XS | Suboptimal alignment score |
XF | Support from forward/reverse alignment |
XE | Number of supporting seeds |
Sources:
genome.sph.umich.ed with further useful details, full specs.
Image by PublicDomainPictures from Pixabay