Comparing instance prices on the Amazon cloud

Posted on Posted in cloud computing

As the largest cloud computing company Amazon Web Services (AWS) offers various options to use compute-power on an “as-needed” basis. You can choose what size and type of machine, what number of machines – and you can choose a price model where you are “bidding” for the resource. This means you might have to wait longer to get it, but you will get an impressive discount! You can choose your machines from the AWS dashboard.

Here is a comparison of the current prices for “General Purpose – Current Generation” AWS machines in the EU (Frankfurt) region (as of 13/04/2017):

vCPU ECU Memory (GiB) Instance Storage (GB) Linux / UNIX Usage On-Demand Price per Hour Spot Price per Hour Saving %
m4.large 2 6.5 8 EBS Only $0.129 $0.0336 74
m4.xlarge 4 13 16 EBS Only $0.257 $0.0375 85
m4.2xlarge 8 26 32 EBS Only $0.513 $0.1199 77
m4.4xlarge 16 53.5 64 EBS Only $1.026 $0.3536 66
m4.10xlarge 40 124.5 160 EBS Only $2.565 $1.1214 56
m4.16xlarge 64 188 256 EBS Only $4.104 $0.503 88
m3.medium 1 3 3.75 1×4 SSD $0.079 $0.0114 86
m3.large 2 6.5 7.5 1×32 SSD $0.158 $0.0227 86
m3.xlarge 4 13 15 2×40 SSD $0.315 $0.047 85
m3.2xlarge 8 26 30 2×80 SSD $0.632 $0.1504 76

This only shows a selection of machine options and the prices obviously change over time – but the message should be clear…

Determining Coverage for NGS Data

Posted on Posted in bioinformatics, genome informatics, sequencing

One of the most common question after your Next-Generation Sequencing (NGS) run is: “What is the coverage I achieved?” Unfortunately there a different definitions available for the term “coverage”, but in general I would suggest to use “How many different reads are aligned at any position of the target genome on average?”

To answer this question the straightforward calculation is called Lander-Waterman equation and states:

C = L * N / G

meaning the coverage C is the length of the reads (L) multiplied by the number of reads (N) divided by the length of the target genome (G).

This should be done after duplicates and non-aligning reads have been removed. From a BAM file you can use samtools and Unix awk magic to get an average with standard deviation:

samtools depth -a aligned_reads.bam | awk '{sum+=$3; sumsq+=$3*$3} END {  print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2) }'

It is however important to remember that this will still only be an estimate of the average coverage of each base of the genome. There are regions in most genomes that are harder to generate sequencing reads for and regions that are harder to get a reliable alignment for (Mappability). This means that even though you might have the average coverage of 30x required for many applications for the analysis of the (human) genome, there will be regions where there are no reads at all!

Further details, in particular for Illumina NGS runs and how to calculate the target depth can be found at the Illumina site.
A tools discussion at Biostars.

Detect user’s OS using bash

Posted on Posted in software, unix

Here is an approach to detect the user’s operating system using a bash script. This might be useful during installation routines of your software. Inspired by StackOverflow answers.

 

# determine operating system, alternative: use $OSTYPE
printf "Detecting OS\n"
platform='unknown'
OS=`uname`
case $OS in
    'Linux')
        platform='Linux'
        # do something for linux...
        ;;
    'FreeBSD')
        platform='FreeBSD'
        ;;
    'WindowsNT')
        platform='Windows'
        ;;
    'Darwin') 
        platform='Mac'
        ;;
    'SunOS')
        platform='Solaris'
        ;;
    *)
        printf "The operating systen could not be determined.\nPlease select\n \
            1 - Apple OsX\n\
            2 - Microsoft Windows\n\
            3 - Linux\n"
        read os_num
        case $os_num in
            '1')
                platform='Mac'
                ;;
            '2')
                platform='Windows'
                ;;
            '3')
                platform='Linux'
                ;;
            *)
                printf "Unknown OS, exiting.\n\n"
                exit
                ;;
        esac
        ;;
esac
printf "Found OS ${platform}\n\n"

Create BigWig files with DeepTools

Posted on Posted in genome informatics, software

To display data as bar graphs along the genome e.g. in the UCSC genome browser, you can create BigWig files. The underlying Wig format is described in more detail here.
BigWig is the binary version (described here), that allows compressing the data and streaming of the data from a remote location to the machine running the display (i.e. the genome browser) on demand.

To create and work with the data various software options are available, my current recommendation is:

DeepTools, a “suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data”. The docu pages show more options, but to get started install with:

conda install -c bioconda deeptools

or

pip install --user deepTools

and create a BigWig from a Bam file:

bamCoverage -b Seqdata.bam -o Seqdata.bigWig

Originally a Galaxy workflow, DeepTools2 can run on the command line or as an API. It was published by Fidel Ramírez et al.

You can display the data in the UCSC browser by adding a custom track (more details) in the form on the track control page using a line like the following to point to your file that needs to be internet-accessible:

track type=bigWig name="My Big Wig" description="Seqdata BigWig file" bigDataUrl=https://your.server.com/Seqdata.bigWig

Allow others to upload data to your Amazon S3 bucket

Posted on Posted in cloud computing

When processing data on the cloud for somebody else (e.g. a customer) or you want to share large amounts of data for other reasons, a good option is to directly use cloud storage for this. Besides DropBox (which requires a contract with a monthly fee for this kind of data) and other offerings, Amazon provides S3: You already get 5 GB of free S3 storage as part of the “Free Tier“, for anything else you only pay for what you use. At the current price(August 2017) you pay $0.023 per GB for standard storage.Additionally you have to pay for access operations: PUT, COPY, POST, or LIST Requests $0.005 per 1,000 requests, GET and all other Requests $0.004 per 10,000 requests. Data transfers between S3 buckets or from S3 to any Amazon cloud service(s) within the same region are free, to other regions or download via the internet incurr additional prices. Please check back on the Amazon pages for curent prices and conditions.

Using the excellent Walk-through example from AWS, here is a condensed version of the set-up that worked for me. I needed to process large data files for a customer, let’s call them DataHeros Ltd. here, downloading to my machine and uploading to the cloud would have been highly inefficient. FTP or other access to the customer data was not possible.

The main steps are:

  1. Create IAM profile for the customer
    – Sign in to AWS, create a new IAM user (let’s call him/her “data-heros-ltd”) with password but no permissions yet.
    – Also create a group (called “customers”)
  2. Create S3 bucket with folder for the customer
    – Go to the S3 console.
    – Create a bucket “my-company-bucket” and a folder “customer-A”, “customer-B”, etc. within “customer-A” I create a subfolder “DataHeros-Ltd”. All customers will see the initial listing, but they don’t need to see the actual names of other customer. At the same time the customer is sure that the second-level folder is the correct one for his/her data.
    Let’s also create a folder “other-data” for just that, where only I have access to.
  3. Set permission to allow access to this area only
    All customers (group) need list acess to the bucket, the current customer DataHeros-Ltd needs write access to the “customer-A / DataHeros-Ltd” area.
    The permissions are set using:
    – the inline policy of the user data-heros-ltd: Policy 1 below
    – a policy attached to the group customers: Policy 2
    – the general Bucket Policy in the Permissions tab of the bucket Policy 3, which includes your AWS account ID. You can find this on top of the billing console.
  4. From the console you can now customize the link to give to customers along with their username and password. They should be able to upload data to their folder using the web interface which I can then process.

Policy 1:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListBucketIfSpecificPrefixIsIncludedInRequest",
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::my-company-bucket"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "customer-A/*"
                    ]
                }
            }
        },
        {
            "Sid": "AllowUserToReadWriteObjectDataInDevelopmentFolder",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::my-company-bucket/customer-A/*"
            ]
        }
    ]
}

Policy 2:

{
  "Version": "2012-10-17",                 
  "Statement": [
    {
      "Sid": "AllowGroupToSeeBucketListAndAlsoAllowGetBucketLocationRequiredForListBucket",
      "Action": ["s3:ListAllMyBuckets", "s3:GetBucketLocation"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::*"]
    },
    {
      "Sid": "AllowRootLevelListingOfCompanyBucket",
      "Action": ["s3:ListBucket"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::my-company-bucket"],
      "Condition":{
          "StringEquals":{"s3:prefix":[""]}
       }
    },
    {
      "Sid": "RequireFolderStyleList",
      "Action": ["s3:ListBucket"],
      "Effect": "Deny",
      "Resource": ["arn:aws:s3:::*"],
      "Condition":{
          "StringNotEquals":{"s3:delimiter":"/"}
       }
     },
    {
      "Sid": "ExplictDenyAccessToPrivateFolderToEveryoneInTheGroup",
      "Action": ["s3:*"],
      "Effect": "Deny",
      "Resource":["arn:aws:s3:::my-company-bucket/other-data/*"]
    },
    {
      "Sid": "DenyListBucketOnPrivateFolder",
      "Action": ["s3:ListBucket"],
      "Effect": "Deny",
      "Resource": ["arn:aws:s3:::*"],
      "Condition":{
          "StringLike":{"s3:prefix":["other-data/"]}
       }
    }
  ]
}

Policy 3:

{
    "Version": "2012-10-17",
    "Id": "Policy1502460168202",
    "Statement": [
        {
            "Sid": "Stmt1502460165827",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::MY-AWS-ID:root"
            },
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::my-company-bucket/*"
        }
    ]
}