Determining Coverage for NGS Data

Posted Posted in bioinformatics, genome informatics, sequencing

One of the most common question after your Next-Generation Sequencing (NGS) run is: “What is the coverage I achieved?” Unfortunately there a different definitions available for the term “coverage”, but in general I would suggest to use “How many different reads are aligned at any position of the target genome on average?”

To answer this question the straightforward calculation is called Lander-Waterman equation and states:

C = L * N / G

meaning the coverage C is the length of the reads (L) multiplied by the number of reads (N) divided by the length of the target genome (G).

This should be done after duplicates and non-aligning reads have been removed. From a BAM file you can use samtools and Unix awk magic to get an average with standard deviation:

samtools depth -a aligned_reads.bam | awk '{sum+=$3; sumsq+=$3*$3} END {  print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2) }'

It is however important to remember that this will still only be an estimate of the average coverage of each base of the genome. There are regions in most genomes that are harder to generate sequencing reads for and regions that are harder to get a reliable alignment for (Mappability). This means that even though you might have the average coverage of 30x required for many applications for the analysis of the (human) genome, there will be regions where there are no reads at all! To assess this kind of coverage the term “breadth of coverage” has been established to describe how many regions of the genome are covered by reads.

Further details, in particular for Illumina NGS runs and how to calculate the target depth can be found at the Illumina site.
A tools discussion at Biostars.

Detect user’s OS using bash

Posted Posted in software, unix

Here is an approach to detect the user’s operating system using a bash script. This might be useful during installation routines of your software. Inspired by StackOverflow answers.

 

# determine operating system, alternative: use $OSTYPE
printf "Detecting OS\n"
platform='unknown'
OS=`uname`
case $OS in
    'Linux')
        platform='Linux'
        # do something for linux...
        ;;
    'FreeBSD')
        platform='FreeBSD'
        ;;
    'WindowsNT')
        platform='Windows'
        ;;
    'Darwin') 
        platform='Mac'
        ;;
    'SunOS')
        platform='Solaris'
        ;;
    *)
        printf "The operating systen could not be determined.\nPlease select\n \
            1 - Apple OsX\n\
            2 - Microsoft Windows\n\
            3 - Linux\n"
        read os_num
        case $os_num in
            '1')
                platform='Mac'
                ;;
            '2')
                platform='Windows'
                ;;
            '3')
                platform='Linux'
                ;;
            *)
                printf "Unknown OS, exiting.\n\n"
                exit
                ;;
        esac
        ;;
esac
printf "Found OS ${platform}\n\n"

Create BigWig files with DeepTools

Posted Posted in genome informatics, software

To display data as bar graphs along the genome e.g. in the UCSC genome browser, you can create BigWig files. The underlying Wig format is described in more detail here.
BigWig is the binary version (described here), that allows compressing the data and streaming of the data from a remote location to the machine running the display (i.e. the genome browser) on demand.

To create and work with the data various software options are available, my current recommendation is:

DeepTools, a “suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data”. The docu pages show more options, but to get started install with:

conda install -c bioconda deeptools

or

pip install --user deepTools

and create a BigWig from a Bam file:

bamCoverage -b Seqdata.bam -o Seqdata.bigWig

Originally a Galaxy workflow, DeepTools2 can run on the command line or as an API. It was published by Fidel Ramírez et al.

You can display the data in the UCSC browser by adding a custom track (more details) in the form on the track control page using a line like the following to point to your file that needs to be internet-accessible:

track type=bigWig name="My Big Wig" description="Seqdata BigWig file" bigDataUrl=https://your.server.com/Seqdata.bigWig

Allow others to upload data to your Amazon S3 bucket

Posted Posted in cloud computing

When processing data on the cloud for somebody else (e.g. a customer) or you want to share large amounts of data for other reasons, a good option is to directly use cloud storage for this. Besides DropBox (which requires a contract with a monthly fee for this kind of data) and other offerings, Amazon provides S3: You already get 5 GB of free S3 storage as part of the “Free Tier“, for anything else you only pay for what you use. At the current price(August 2017) you pay $0.023 per GB for standard storage.Additionally you have to pay for access operations: PUT, COPY, POST, or LIST Requests $0.005 per 1,000 requests, GET and all other Requests $0.004 per 10,000 requests. Data transfers between S3 buckets or from S3 to any Amazon cloud service(s) within the same region are free, to other regions or download via the internet incurr additional prices. Please check back on the Amazon pages for curent prices and conditions.

Using the excellent Walk-through example from AWS, here is a condensed version of the set-up that worked for me. I needed to process large data files for a customer, let’s call them DataHeros Ltd. here, downloading to my machine and uploading to the cloud would have been highly inefficient. FTP or other access to the customer data was not possible.

The main steps are:

  1. Create IAM profile for the customer
    – Sign in to AWS, create a new IAM user (let’s call him/her “data-heros-ltd”) with password but no permissions yet.
    – Also create a group (called “customers”)
  2. Create S3 bucket with folder for the customer
    – Go to the S3 console.
    – Create a bucket “my-company-bucket” and a folder “customer-A”, “customer-B”, etc. within “customer-A” I create a subfolder “DataHeros-Ltd”. All customers will see the initial listing, but they don’t need to see the actual names of other customer. At the same time the customer is sure that the second-level folder is the correct one for his/her data.
    Let’s also create a folder “other-data” for just that, where only I have access to.
  3. Set permission to allow access to this area only
    All customers (group) need list acess to the bucket, the current customer DataHeros-Ltd needs write access to the “customer-A / DataHeros-Ltd” area.
    The permissions are set using:
    – the inline policy of the user data-heros-ltd: Policy 1 below
    – a policy attached to the group customers: Policy 2
    – the general Bucket Policy in the Permissions tab of the bucket Policy 3, which includes your AWS account ID. You can find this on top of the billing console.
  4. From the console you can now customize the link to give to customers along with their username and password. They should be able to upload data to their folder using the web interface which I can then process.

Policy 1:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListBucketIfSpecificPrefixIsIncludedInRequest",
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::my-company-bucket"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "customer-A/*"
                    ]
                }
            }
        },
        {
            "Sid": "AllowUserToReadWriteObjectDataInDevelopmentFolder",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::my-company-bucket/customer-A/*"
            ]
        }
    ]
}

Policy 2:

{
  "Version": "2012-10-17",                 
  "Statement": [
    {
      "Sid": "AllowGroupToSeeBucketListAndAlsoAllowGetBucketLocationRequiredForListBucket",
      "Action": ["s3:ListAllMyBuckets", "s3:GetBucketLocation"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::*"]
    },
    {
      "Sid": "AllowRootLevelListingOfCompanyBucket",
      "Action": ["s3:ListBucket"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::my-company-bucket"],
      "Condition":{
          "StringEquals":{"s3:prefix":[""]}
       }
    },
    {
      "Sid": "RequireFolderStyleList",
      "Action": ["s3:ListBucket"],
      "Effect": "Deny",
      "Resource": ["arn:aws:s3:::*"],
      "Condition":{
          "StringNotEquals":{"s3:delimiter":"/"}
       }
     },
    {
      "Sid": "ExplictDenyAccessToPrivateFolderToEveryoneInTheGroup",
      "Action": ["s3:*"],
      "Effect": "Deny",
      "Resource":["arn:aws:s3:::my-company-bucket/other-data/*"]
    },
    {
      "Sid": "DenyListBucketOnPrivateFolder",
      "Action": ["s3:ListBucket"],
      "Effect": "Deny",
      "Resource": ["arn:aws:s3:::*"],
      "Condition":{
          "StringLike":{"s3:prefix":["other-data/"]}
       }
    }
  ]
}

Policy 3:

{
    "Version": "2012-10-17",
    "Id": "Policy1502460168202",
    "Statement": [
        {
            "Sid": "Stmt1502460165827",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::MY-AWS-ID:root"
            },
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::my-company-bucket/*"
        }
    ]
}

aCGH array QC measures

Posted Posted in microarrays

The within-array quality for (genomic) microarrays is often measured using the following metrics:

  1. Standard Deviation Autosome / Robust (SD autosome): Measure of the dispersion of Log2 ratio of all clones on the array, giving an overall picture of the noise in the array. It is calculated on the normalised but un-smoothed data. The SD robust is the middle 58%/66% of the data. By excluding outliers large changes such as trisomies will not cause this number to change significantly. (The SD robust is the number we use when we say “3 SDs away from the noise” in the calling algorithm.) Both measures are given after all data processing but excluding any smoothing. For BlueFuse Multi processed data the values should be 0.07-0.15 and 0.05-0.11 for the autosome and robust measure respectively.
  2. Signal to Background Ratio (SBR): Brightness of the mean signal (after the background has been subtracted) divided by the raw background signal (global signal).
  3. Derivative Log2 Ratio / Fused (DLR): Measure of the probe-to-probe variability. In an ideal world, probes within a region will have essentially the same ratio. In a noisy array adjacent probes can have a very large ratio difference. The DLR raw is before any data processing, DLR fused is after normalization and data correction BUT is always done on un-smoothed data so it is user setting independent and a cannot be adjusted by the user thereby giving a consistent array-to-array measure of noise. BlueFuse results should be < 0.2.
  4. % included clones: Percentage of all clones that were not excluded on a BAC array due to inconsistencies between clone replicates. For BlueFuse results this should be > 95 %.
  5. Mean Spot Amplitude: The mean fluorescent signal intensities for the two channels; channel 1 = sample (standardly Cy3; ex 550nm, emm 570nm) and channel 2 = reference (standardly Cy5; ex 650nm, emm 670nm). This metric is variable due to the differences between available scanners. The mean spot amplitude metric can give an indication of how well the DNA has labelled with fluorescent dyes, but more importantly, really high values can indicate over scanning of the microarray image OR can indicate poor washing so there is lots of non-specific signal left. The balance between channels can be assessed but the Cy5 signal tends to give a higher intensity than Cy3, major differences in the channels may indicate a labelling or a scanner problem.

Source: BlueGnome user docs

See also: Microarray Scanners and PGS consulting in the UK & Ireland