Allow another user to connect to my EC2 cloud machine

Posted Posted in cloud computing, unix

There might be times when you either have something fascinating happening on the Amazon cloud machine that you set up, or – more likely – you got stuck with a problem a friend might be able to help you with. Here are notes on how to set up a shared access for a standard (Ubuntu linux) computer.

If you set up your EC2 machine securely, it will not allow anyone but you to access it though: The “security group” used allows your IP only.  (If you set it up insecurely with Source = 0.0.0.0/0 your friend – and anyone else – will be able to access it directly!) There is however the option to modify this security group, and after a few minutes delay it will be applied even to running machines! So all you need to do to allow your friend to work alongside of you is to add his IP address and add his public SSH key:

Part 1 (see screenshot below):

  • Find out the additional IP address, e.g. with www.iplocation.net/find-ip-address from your friend’s computer, e.g. 79.217.24.86
  • Go to the EC2 management console
  • In the navigation pane, choose Network Interfaces.
  • Select the network interface and choose Actions >> Edit inbound rules
  • Add a new SSH rule with the additional IP address, e.g. 79.217.24.86/32 and save it. 

Part 2:

  • copy his public SSH key to the machine (using your own private key):

    scp -i "~/.ssh/your-key-region1.pem" friends_key.pub ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com:~/
  • log into the machine (with your own private key):

    ssh -i "~/.ssh/your-key.pem" ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com
  • add the new key to your existing keys:

    cat friends_key.pub >> .ssh/authorized_keys

He can now log in with his own (private) key:

ssh -i "~/.ssh/friends_key.pem" ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com

 

This description was based on this help page.

RNA-Seq data quality scores

Posted Posted in bioinformatics, sequencing

There are different ways to encode the quality scores in FASTQ files from Next-generation sequencing machines. It is important to find out before using the data and to convert between formats if necessary.

  • Sanger format can encode a Phred quality score from 0 to 93 using ASCII characters 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps).
  • Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII characters 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
  • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII characters 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)
  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126


 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

Source: wikipedia

For a simple look-up from ASCII to numeric scores you can use the following list:

ASCII	numeric		ASCII	numeric
!	0		@	31
"	1		A	32
#	2		B	33
$	3		C	34
%	4		D	35
&	5		E	36
'	6		F	37
(	7		G	38
)	8		H	39
*	9		I	40
+	10		J	41
,	11		K	42
-	12		L	43
.	13		M	44
/	14		N	45
0	15		O	46
1	16		P	47
2	17		Q	48
3	18		R	49
4	19		S	50
5	20		T	51
6	21		U	52
7	22		V	53
8	23		W	54
9	24		X	55
:	25		Y	56
;	26		Z	57
<	27		[	58
=	28		\	59
>	29		]	60
?	30		^	61

You can convert the Solexa read quality to Sanger read quality with Maq:

maq sol2sanger s_1_sequence.txt s_1_sequence.fastq

where s_1_sequence.txt is the Solexa read sequence file. Missing this step will lead to unreliable SNP calling when aligning reads with Maq.

Source: maq-manual

Phred itself is a base calling program for DNA sequence traces developed during the initial automation phase of the sequencing of the human genome.
After calling bases, Phred examines the peaks around each base call to assign a quality score to each base call. Quality scores range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities, as shown in the following table:

Phred quality	Probability of		Accuracy of
score		wrong base call		base call
10 		1 in 10 		90%
20 		1 in 100 		99%
30 		1 in 1,000 		99.9%
40 		1 in 10,000 		99.99%
50 		1 in 100,000 		99.999%

“High quality bases” are usually scores of 20 and above (“Phred20 score”).

You can read the original publications about the Phred program and scoring by Brent Ewing et al. from Phil Green’s lab here and here.
Source: www.phrap.com

The CCDS project

Posted Posted in bioinformatics, genome informatics

The Consensus CoDing Sequence (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
The IDs are helpful when referring to specific genes and annotation versions in publications as they can be tracked and found even after the underlying genome has changed.

Participating institutes:

  • European Bioinformatics Institute (EBI)
  • National Center for Biotechnology Information (NCBI)
  • Wellcome Trust Sanger Institute (WTSI)
  • University of California, Santa Cruz (UCSC)
  • HUGO Gene Nomenclature Committee (HGNC)
  • Mouse Genome Informatics (MGI)

Project page at NCBI

CCDS Identifiers and Tracking
Annotated genes are given a unique identifier number and version number (e.g. CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.

Genomic Start Coordinates

Posted Posted in bioinformatics, genome informatics

Adding to the confusion about different notations of phases/frames, the start coordinates of genomic features are also noted differently between different genome browsers and file formats.

  1. One-based
    Counting bases starting with “1” at the first position.
    Regions are specified by a “closed interval.” Used e.g. by the Ensembl genome browser and annotation system, the GFF/GTF, SAM and wiggle file formats.
  2. Zero-based
    The interbase system counts spaces starting with “0” at the first position.
    Regions are specified by a “half-closed-half-open interval”. Used by the UCSC genome browser, Chado (the fruitfly browser), the BED, BAM and PSL file formats.

An example:

  One-based
     1 2 3 4 5 6
     | | | | | |
     C G A T G C
    | | | | | | |
    0 1 2 3 4 5 6
  Zero-based

The ATG interval would be described from 3-5 in the first, from 2-5 in the second system.

Coding Phases / Frames

Posted Posted in bioinformatics, genome informatics

The phase (or sometimes called frame) gives information on how to translate individual parts of a gene, the coding exons. Phases 1 & 2 have a different definition in GFF and EnsEMBL format!
In EnsEMBL, the phase is defined for exon objects like this:
The Ensembl phase convention can be thought of as “the number of bases of the first codon which are on the previous exon”. It is therefore 0, 1 or 2 (-1 means the exon is non-coding).
In ascii art, with alternate codons represented by ### and +++:

       Previous Exon   Intron   This Exon

    ...-------------            -------------...

    5'                  Phase                3'

    ...#+++###+++###     0      +++###+++###+...

    ...+++###+++###+     1      ++###+++###++...

    ...++###+++###++     2      +###+++###+++...

In the GFF format, the 8th column gives phase information for CDS features. The definition of phases is here:

For features of type “CDS”, the phase indicates where the feature (i.e. exon) begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of “0” indicates that the next codon begins at the first base of the region described by the current line, a phase of “1” indicates that the next codon begins at the second base of this region, and a phase of “2” indicates that the codon begins at the third base of this region.
For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

[Ref]

In effect, you can usually translate the phase from Ensembl to GFF-style like this:

  • 0 to 0
  • 1 to 2, the initial first base is added to last exon’s codon
  • 2 to 1, the initial first two bases are added to last exon’s codon

The DAS protocol defines the phase as the GFF format:
The tag indicates the position of the feature relative to open reading frame, if any. It may be one of the integers 0, 1 or 2, corresponding to each of the three reading frames, or “-” if the feature is unrelated to a reading frame.

[Some more infos on different formats]