The CCDS project

Posted Posted in bioinformatics, genome informatics

The Consensus CoDing Sequence (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
The IDs are helpful when referring to specific genes and annotation versions in publications as they can be tracked and found even after the underlying genome has changed.

Participating institutes:

  • European Bioinformatics Institute (EBI)
  • National Center for Biotechnology Information (NCBI)
  • Wellcome Trust Sanger Institute (WTSI)
  • University of California, Santa Cruz (UCSC)
  • HUGO Gene Nomenclature Committee (HGNC)
  • Mouse Genome Informatics (MGI)

Project page at NCBI

CCDS Identifiers and Tracking
Annotated genes are given a unique identifier number and version number (e.g. CCDS1.1, CCDS234.1). The version number will update if the CDS structure changes, or if the underlying genome sequence changes at that location. With annotation and sequence based genome browser update cycles, the CCDS set will be mapped forward, maintaining identifiers. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.

Genomic Start Coordinates

Posted Posted in bioinformatics, genome informatics

Adding to the confusion about different notations of phases/frames, the start coordinates of genomic features are also noted differently between different genome browsers and file formats.

  1. One-based
    Counting bases starting with “1” at the first position.
    Regions are specified by a “closed interval.” Used e.g. by the Ensembl genome browser and annotation system, the GFF/GTF, SAM and wiggle file formats.
  2. Zero-based
    The interbase system counts spaces starting with “0” at the first position.
    Regions are specified by a “half-closed-half-open interval”. Used by the UCSC genome browser, Chado (the fruitfly browser), the BED, BAM and PSL file formats.

An example:

  One-based
     1 2 3 4 5 6
     | | | | | |
     C G A T G C
    | | | | | | |
    0 1 2 3 4 5 6
  Zero-based

The ATG interval would be described from 3-5 in the first, from 2-5 in the second system.

Coding Phases / Frames

Posted Posted in bioinformatics, genome informatics

The phase (or sometimes called frame) gives information on how to translate individual parts of a gene, the coding exons. Phases 1 & 2 have a different definition in GFF and EnsEMBL format!
In EnsEMBL, the phase is defined for exon objects like this:
The Ensembl phase convention can be thought of as “the number of bases of the first codon which are on the previous exon”. It is therefore 0, 1 or 2 (-1 means the exon is non-coding).
In ascii art, with alternate codons represented by ### and +++:

       Previous Exon   Intron   This Exon

    ...-------------            -------------...

    5'                  Phase                3'

    ...#+++###+++###     0      +++###+++###+...

    ...+++###+++###+     1      ++###+++###++...

    ...++###+++###++     2      +###+++###+++...

In the GFF format, the 8th column gives phase information for CDS features. The definition of phases is here:

For features of type “CDS”, the phase indicates where the feature (i.e. exon) begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of “0” indicates that the next codon begins at the first base of the region described by the current line, a phase of “1” indicates that the next codon begins at the second base of this region, and a phase of “2” indicates that the codon begins at the third base of this region.
For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

[Ref]

In effect, you can usually translate the phase from Ensembl to GFF-style like this:

  • 0 to 0
  • 1 to 2, the initial first base is added to last exon’s codon
  • 2 to 1, the initial first two bases are added to last exon’s codon

The DAS protocol defines the phase as the GFF format:
The tag indicates the position of the feature relative to open reading frame, if any. It may be one of the integers 0, 1 or 2, corresponding to each of the three reading frames, or “-” if the feature is unrelated to a reading frame.

[Some more infos on different formats]

NoSQL Databases

Posted Posted in bioinformatics, database, software

Originally developed for large-scale web applications, nosql database management system (dbs) like to see themselves as next generation databases and are using “not only sql“. Moving away from the strict model and query approach of traditional , some nosql dbs offer more flexibility and easier scalability. They can be described as being non-relational, distributed, open-source, horizontally scalable, schema-free, easy replication support, simple API, eventually consistent.

There are many different types of nosql dbs being developed, here are some examples:

  1. Document store dbs
    Systems like Apache’s Couch DB store unstructured text and create views on them. The data is stored in JSON and binary formats in a distributed manner, queries often use reduce operations.
  2. Graph dbs
    Neo4j and FlockDB are example systems that store data as nodes with relationships between them.
  3. Key-Value / tuple store dbs
    An even simpler approach is taken by in-memory or on-disk key-value dbs like Redis where simple text or string sets are stored and queried e.g. via a web service.

To try your hands on an example system, you could e.g. use mongodb. This is a type of document-store dbs with the data stored in JSON-like files. The installation on an Ubuntu 16.4 system would follow the instructions here:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6
echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo service mongod start

By default the service will start on port 27017. To try out some data operations:

mongo
> use no_sql_db_test
> db.person.insert({"name":"Andrea","age":19})
> db.person.insert({"name":"Paul","age":23,"pet":"dog"})
> db.person.find()
  { "_id" : ObjectId("5a12dcdf81636f8f3070ce13"), "name" : "Andrea", "age" : 19 }
  { "_id" : ObjectId("5a12dd0181636f8f3070ce14"), "name" : "Paul", "age" : 23, "pet" : "dog" }
> db.person.find({"age":{$lt:20}})
  { "_id" : ObjectId("5a12dcdf81636f8f3070ce13"), "name" : "Andrea", "age" : 19 }

An overview and intro tutorial can be found here. To stop the db service run:

sudo service mongod stop

Sources & further reading:
c’t (German), Wikipedia, list of systems

Comparing instance prices on the Amazon cloud

Posted Posted in cloud computing

As the largest cloud computing company Amazon Web Services (AWS) offers various options to use compute-power on an “as-needed” basis. You can choose what size and type of machine, what number of machines – and you can choose a price model where you are “bidding” for the resource. This means you might have to wait longer to get it, but you will get an impressive discount! You can choose your machines from the AWS dashboard.

Here is a comparison of the current prices for “General Purpose – Current Generation” AWS machines in the EU (Frankfurt) region (as of 13/04/2017):

vCPU ECU Memory (GiB) Instance Storage (GB) Linux / UNIX Usage On-Demand Price per Hour Spot Price per Hour Saving %
m4.large 2 6.5 8 EBS Only $0.129 $0.0336 74
m4.xlarge 4 13 16 EBS Only $0.257 $0.0375 85
m4.2xlarge 8 26 32 EBS Only $0.513 $0.1199 77
m4.4xlarge 16 53.5 64 EBS Only $1.026 $0.3536 66
m4.10xlarge 40 124.5 160 EBS Only $2.565 $1.1214 56
m4.16xlarge 64 188 256 EBS Only $4.104 $0.503 88
m3.medium 1 3 3.75 1×4 SSD $0.079 $0.0114 86
m3.large 2 6.5 7.5 1×32 SSD $0.158 $0.0227 86
m3.xlarge 4 13 15 2×40 SSD $0.315 $0.047 85
m3.2xlarge 8 26 30 2×80 SSD $0.632 $0.1504 76

This only shows a selection of machine options and the prices obviously change over time – but the message should be clear…

Machine categories / families

To get an idea what the different machines are here are the current categories:

Instance Family Current Generation Instance Types
General purpose t2.nano, t2.micro, t2.small, t2.medium, t2.large, t2.xlarge, t2.2xlarge, m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m4.10xlarge, m4.16xlarge, m5.large, m5.xlarge, m5.2xlarge, m5.4xlarge, m5.12xlarge, m5.24xlarge
Compute optimized c4.large, c4.xlarge, c4.2xlarge, c4.4xlarge, c4.8xlarge, c5.large, c5.xlarge, c5.2xlarge, c5.4xlarge, c5.9xlarge, c5.18xlarge
Memory optimized r4.large, r4.xlarge, r4.2xlarge, r4.4xlarge, r4.8xlarge, r4.16xlarge, x1.16xlarge, x1.32xlarge, x1e.xlarge, x1e.2xlarge, x1e.4xlarge, x1e.8xlarge, x1e.16xlarge, x1e.32xlarge
Storage optimized d2.xlarge, d2.2xlarge, d2.4xlarge, d2.8xlarge, h1.2xlarge, h1.4xlarge, h1.8xlarge, h1.16xlarge, i3.large, i3.xlarge, i3.2xlarge, i3.4xlarge, i3.8xlarge, i3.16xlarge
Accelerated computing f1.2xlarge, f1.16xlarge, g3.4xlarge, g3.8xlarge, g3.16xlarge, p2.xlarge, p2.8xlarge, p2.16xlarge, p3.2xlarge, p3.8xlarge, p3.16xlarge

 
and slightly older models:

Instance Family Previous Generation Instance Types
General purpose m1.small, m1.medium, m1.large, m1.xlarge, m3.medium, m3.large, m3.xlarge, m3.2xlarge
Compute optimized c1.medium, c1.xlarge, cc2.8xlarge, c3.large, c3.xlarge, c3.2xlarge, c3.4xlarge, c3.8xlarge
Memory optimized m2.xlarge, m2.2xlarge, m2.4xlarge, cr1.8xlarge, r3.large, r3.xlarge, r3.2xlarge, r3.4xlarge, r3.8xlarge
Storage optimized hs1.8xlarge, i2.xlarge, i2.2xlarge, i2.4xlarge, i2.8xlarge
GPU optimized g2.2xlarge, g2.8xlarge
Micro t1.micro

Source: AWS