RNA-Seq data quality scores

December 13, 2017

bioinformatics sequencing

There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.

Sanger format can encode a Phred quality score from 0 to 93 using ASCII characters 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII characters 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII characters 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126


 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

Source: wikipedia

For a simple look-up from ASCII to numeric scores you can use the following list:

ASCII	numeric		ASCII	numeric
!	0		@	31
"	1		A	32
#	2		B	33
$	3		C	34
%	4		D	35
&	5		E	36
'	6		F	37
(	7		G	38
)	8		H	39
*	9		I	40
+	10		J	41
,	11		K	42
-	12		L	43
.	13		M	44
/	14		N	45
0	15		O	46
1	16		P	47
2	17		Q	48
3	18		R	49
4	19		S	50
5	20		T	51
6	21		U	52
7	22		V	53
8	23		W	54
9	24		X	55
:	25		Y	56
;	26		Z	57
<	27		[	58
=	28		\	59
>	29		]	60
?	30		^	61

You can convert the Solexa read quality to Sanger read quality with Maq:

maq sol2sanger s_1_sequence.txt s_1_sequence.fastq

where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.

Source: maq-manual

Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:

Phred quality	Probability of		Accuracy of
score		wrong base call		base call
10 		1 in 10 		90%
20 		1 in 100 		99%
30 		1 in 1,000 		99.9%
40 		1 in 10,000 		99.99%
50 		1 in 100,000 		99.999%

“High quality bases” are usually scores of 20 and above (“Phred20 score”).

You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green’s lab here and here.
Source: www.phrap.com

Felix Kokocinski

Related Articles

Unlocking Business Potential: Exploring the Value of External Bioinformatics Consultants

SAM format summary

Decoding REST: Simplifying Interconnected IT Systems