bioinformatics

CRAM format notes

CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.

Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.

The compression rates achieved are shown in the graph below generated by Uppsala University:

File size comparisons of SAM, BAM, CRAM

Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.)

Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance. 

Binning of similar Q-scores (Illumina):

qscore binning

Compression achieved by Q-score binning (Illumina):

qscore compression

Sources and further reading:

  1. Format definition and usage
  2. cram-toolkit
  3. Detailed report at the Uppsala University
  4. SAMtools with CRAM support
  5. Original article from Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney
  6. Article about the implementation in C
  7. Illumina while paper on Qscore compression