RNA-Seq data quality scores

Posted Posted in bioinformatics, sequencing

There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.

  • Sanger format can encode a Phred quality score from 0 to 93 using ASCII characters 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
  • Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII characters 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
  • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII characters 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

Source: wikipedia

For a simple look-up from ASCII to numeric scores you can use the following list:

ASCII	numeric		ASCII	numeric
!	0		@	31
"	1		A	32
#	2		B	33
$	3		C	34
%	4		D	35
&	5		E	36
'	6		F	37
(	7		G	38
)	8		H	39
*	9		I	40
+	10		J	41
,	11		K	42
-	12		L	43
.	13		M	44
/	14		N	45
0	15		O	46
1	16		P	47
2	17		Q	48
3	18		R	49
4	19		S	50
5	20		T	51
6	21		U	52
7	22		V	53
8	23		W	54
9	24		X	55
:	25		Y	56
;	26		Z	57
<	27		[	58
=	28		\	59
>	29		]	60
?	30		^	61

You can convert the Solexa read quality to Sanger read quality with Maq:

maq sol2sanger s_1_sequence.txt s_1_sequence.fastq

where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.

Source: maq-manual

Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:

Phred quality	Probability of		Accuracy of
score		wrong base call		base call
10 		1 in 10 		90%
20 		1 in 100 		99%
30 		1 in 1,000 		99.9%
40 		1 in 10,000 		99.99%
50 		1 in 100,000 		99.999%

“High quality bases” are usually scores of 20 and above (“Phred20 score”).

You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green’s lab here and here.
Source: www.phrap.com

The CCDS project

Posted Posted in bioinformatics, genome informatics

The Consensus CoDing Sequence (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
The IDs are helpful when referring to specific genes and annotation versions in publications as they can be tracked and found even after the underlying genome has changed.

Participating institutes:

  • European Bioinformatics Institute (EBI)
  • National Center for Biotechnology Information (NCBI)
  • Wellcome Trust Sanger Institute (WTSI)
  • University of California, Santa Cruz (UCSC)
  • HUGO Gene Nomenclature Committee (HGNC)
  • Mouse Genome Informatics (MGI)

Project page at NCBI

CCDS Identifiers and Tracking
Annotated genes are given a unique identifier number and version number (e.g. CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.

Genomic Start Coordinates

Posted Posted in bioinformatics, genome informatics

Adding to the confusion about different notations of phases/frames, the start coordinates of genomic features are also noted differently between different genome browsers and file formats.

  1. One-based
    Counting bases starting with “1” at the first position.
    Regions are specified by a “closed interval.” Used e.g. by the Ensembl genome browser and annotation system, the GFF/GTF, SAM and wiggle file formats.
  2. Zero-based
    The interbase system counts spaces starting with “0” at the first position.
    Regions are specified by a “half-closed-half-open interval”. Used by the UCSC genome browser, Chado (the fruitfly browser), the BED, BAM and PSL file formats.

An example:

     1 2 3 4 5 6
     | | | | | |
     C G A T G C
    | | | | | | |
    0 1 2 3 4 5 6

The ATG interval would be described from 3-5 in the first, from 2-5 in the second system.

Coding Phases / Frames

Posted Posted in bioinformatics, genome informatics

The phase (or sometimes called frame) gives information on how to translate individual parts of a gene, the coding exons. Phases 1 & 2 have a different definition in GFF and EnsEMBL format!
In EnsEMBL, the phase is defined for exon objects like this:
The Ensembl phase convention can be thought of as “the number of bases of the first codon which are on the previous exon”. It is therefore 0, 1 or 2 (-1 means the exon is non-coding).
In ascii art, with alternate codons represented by ### and +++:

       Previous Exon   Intron   This Exon

    ...-------------            -------------...

    5'                  Phase                3'

    ...#+++###+++###     0      +++###+++###+...

    ...+++###+++###+     1      ++###+++###++...

    ...++###+++###++     2      +###+++###+++...

In the GFF format, the 8th column gives phase information for CDS features. The definition of phases is here:

For features of type “CDS”, the phase indicates where the feature (i.e. exon) begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of “0” indicates that the next codon begins at the first base of the region described by the current line, a phase of “1” indicates that the next codon begins at the second base of this region, and a phase of “2” indicates that the codon begins at the third base of this region.
For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.


In effect, you can usually translate the phase from Ensembl to GFF-style like this:

  • 0 to 0
  • 1 to 2, the initial first base is added to last exon’s codon
  • 2 to 1, the initial first two bases are added to last exon’s codon

The DAS protocol defines the phase as the GFF format:
The tag indicates the position of the feature relative to open reading frame, if any. It may be one of the integers 0, 1 or 2, corresponding to each of the three reading frames, or “-” if the feature is unrelated to a reading frame.

[Some more infos on different formats]

NoSQL Databases

Posted Posted in bioinformatics, database, software

Originally developed for large-scale web applications, nosql databases like to see themselves as next generation dbs and are using “not only sql“. Moving away from the strict model and query approach of traditional database management system (dbs), some nosql dbs offer more flexibility and easier scalability. They can be described as being non-relational, distributed, open-source, horizontally scalable, schema-free, easy replication support, simple API, eventually consistent.

There are many different types of nosql dbs being developed, here are some examples:

  1. Document store dbs
    Systems like Apache’s Couch DB store unstructured text and create views on them. The data is stored in JSON and binary formats in a distributed manner, queries often use reduce operations.
  2. Graph dbs
    Neo4j and FlockDB are example systems that store data as nodes with relationships between them.
  3. Key-Value / tuple store dbs
    An even simpler approach is taken by in-memory or on-disk key-value dbs like Redis where simple text or string sets are stored and queried e.g. via a web service.

To try your hands on an example system, you could e.g. use mongodb. This is a type of document-store dbs with the data stored in JSON-like files. The installation on an Ubuntu 16.4 system would follow the instructions here:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6
echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo service mongod start

By default the service will start on port 27017. To try out some data operations:

> use no_sql_db_test
> db.person.insert({"name":"Andrea","age":19})
> db.person.insert({"name":"Paul","age":23,"pet":"dog"})
> db.person.find()
  { "_id" : ObjectId("5a12dcdf81636f8f3070ce13"), "name" : "Andrea", "age" : 19 }
  { "_id" : ObjectId("5a12dd0181636f8f3070ce14"), "name" : "Paul", "age" : 23, "pet" : "dog" }
> db.person.find({"age":{$lt:20}})
  { "_id" : ObjectId("5a12dcdf81636f8f3070ce13"), "name" : "Andrea", "age" : 19 }

An overview and intro tutorial can be found here. To stop the db service run:

sudo service mongod stop

Sources & further reading:
c’t (German), Wikipedia, list of systems