Sequence Mappability & Alignability

Posted Posted in bioinformatics

Sequence uniqueness within the genome plays an important part when attempting to map short sequence parts – e.g. next-generation short sequencing reads. It is one of the factors that can introduce a bias in sequencing or it’s analysis – the other important factor being GC content (GC-rich sequences, eg. genic/exonic region, as well as very GC-poor regions are often under-represented (Bentley et al. 2008), mainly caused by amplificatin steps in the protocol). Reads mapped to multiple regions are often discarded, genomic regions with high sequence degeneracy / low sequence complexity therefor show lower mapped read coverage than unique regions, creating a systematic bias.

The CRG Alignability tracks at the UCSC genome browser display how uniquely k-mer sequences align to a region of the genome. As you can see from the tracks, the mappability increases with read length:

CRG mappability tracks for different read lengths at the UCSC browser

For each window (of sizes 36, 40, 50, 75 or 100 nts), a mappability score was computed:
S = 1 / (number of matches found in the genome),
so S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on. Further description in the publication of Thomas Derrien, Paolo Ribeca, et al. The data for these tracks can be downloaded, if you are working with other read lengths or genomes, you can run the software to generate the data yourself: Get the Gem library (latest version at GibHub), unpack it with tar xbvf GEM-libraries-Linux-x86_64.tbz2, create an index:

gem-mappability -I gem_index -l 250 -o mappability_250.gem

run the mappability part, eg. with a read length of 250:

gem-mappability -I gem_index -l 250 -o mappability_250.gem

References:

  • Fast computation and applications of genome mappability. Derrien T, et al. PLoS One. 2012
  • The uniqueome: a mappability resource for short-tag sequencing. Koehler et al. Bioinformatics. 2011; 27(2): 272–274.
  • Blog post at MassGenomics
  • Systematic bias in high-throughput sequencing data and its correction by BEADS. Cheung et al. 2011
  • Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry. Bentley et al., Nature 2008