There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.
- Sanger format can encode a Phred quality score from 0 to 93 using ASCII characters 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
- Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII characters 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
- Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII characters 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Source: wikipedia
For a simple look-up from ASCII to numeric scores you can use the following list:
ASCII numeric ASCII numeric ! 0 @ 31 " 1 A 32 # 2 B 33 $ 3 C 34 % 4 D 35 & 5 E 36 ' 6 F 37 ( 7 G 38 ) 8 H 39 * 9 I 40 + 10 J 41 , 11 K 42 - 12 L 43 . 13 M 44 / 14 N 45 0 15 O 46 1 16 P 47 2 17 Q 48 3 18 R 49 4 19 S 50 5 20 T 51 6 21 U 52 7 22 V 53 8 23 W 54 9 24 X 55 : 25 Y 56 ; 26 Z 57 < 27 [ 58 = 28 \ 59 > 29 ] 60 ? 30 ^ 61
You can convert the Solexa read quality to Sanger read quality with Maq:
maq sol2sanger s_1_sequence.txt s_1_sequence.fastq
where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.
Source: maq-manual
Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:
Phred quality Probability of Accuracy of score wrong base call base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999%
“High quality bases” are usually scores of 20 and above (“Phred20 score”).
You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green’s lab here and here.
Source: www.phrap.com