CRAM format notes

CRAM format notes

July 8, 2019

bioinformatics

CRAM files are compressed versions of BAM files containing (aligned) sequencing reads. They represent a further file size reduction for this type of data that is generated at ever increasing quantities. Where SAM files are human-readable text files optimized for short read storage, BAM files are their binary equivalent, and CRAM files are a restructured column-oriented binary container format for even more efficient storage.

Tke key components of the approach are that positions are encoded in a relative way (i.e., the difference between successive positions is stored rather than the absolute value) and stored as a Golomb code. Also, only differences to the reference genome are listed instead of the full sequence.

The compression rates achieved are shown in the graph below generated by Uppsala University:

Comparing speed: Using the C implementation of for CRAM (James K. Bonfield), decoding is 1.5–1.7× slower than generating BAM files, but 1.8–2.6× faster at encoding. (File size savings are reported at 34–55%.)

Additional compression can be achieved by reducing the granularity of the quality values which will result in lossy compression though. Illumina suggested a binning of Q scores without significant calling performance.

Binning of similar Q-scores (Illumina):

Compression achieved by Q-score binning (Illumina):

Sources and further reading:

Format definition and usage
cram-toolkit
Detailed report at the Uppsala University
SAMtools with CRAM support
Original article from Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney
Article about the implementation in C
Illumina while paper on Qscore compression

Feature Image: Cram Vectors by Vecteezy

compression NGS

Felix Kokocinski

Related Articles

Unlocking Business Potential: Exploring the Value of External Bioinformatics Consultants

SAM format summary

Decoding REST: Simplifying Interconnected IT Systems