Encoding information with barcodes

Barcodes are not just for supermarkets! Any medium to high-throughput laboratory can benefit greatly from employing barcodes. The process of identifying and tracking samples and all types of other things (please don’t barcode your technician!) through locations and processes  can:

  • Save time typing the entries manually
  • Greatly reduce the risks of mis-identification
  • Allow longer IDs (i.e. more samples) or more other data without more effort

Traditional barcodes exist as one-dimensional and two-dimensional black-and-white images (some examples in fig. 1). Both types are read using an optical sensor at a set distance. This could be a dedicated barcode-scanner or the camera of your smartphone. One-Dimensional barcodes typically track just one piece of data, such as the sample number. Two-Dimensional barcodes are able to encode more data, including sample ID, customer ID, lot numbers, and more.

There are around 100 different encoding schemes for barcodes. The original standard are “EAN/UPC” linear barcodes, but “Code-128” is becoming more established as it allows letters as well as digits (all 128 ASCII code characters), includes a start and stop identification. The latter as well as a “quite zone” (white space) around the barcode and a checksum digit at the end of the code make the reading more reliable, reducing the risk of mis-identification.
An online resource to see all the different types of barcodes and try encoding information with them can be found here, a specification document is also provided. A whole lecture about the science of barcodes is provided here. Two examples of encoding “Gene-Test” can be seen in fig. 2 & 3:

2-D Codablock-F encoding
Linear EAN128 barcode

In the EAN-128 scheme there are four types of bar sizes, one character is represented by 3 bars and 3 spaces (total six elements). Using the table here we can decode this barcode:

 

Sources and further information:
www.csolsinc.com, barcode.tec-it.com

 

 

Using Container Software

One of the amazingly useful current trends in software development is “containerization“. This describes setting up a self-contained environment on a host computer system, that can run a separate operating system (OS), contain data and software not usually available otherwise. I find this very appealing e.g. for

  • testing a software package that is not available for my usual OS
  • testing my own software on a different OS
  • developing or running an analysis in fully reproducible settings
  • sharing software or analysis with reproducible settings

The main players in the field are Docker and Singularity. There are pros and cons for each, Singularity might be better suited for shared environments as you can run the containers with standard user rights.
The general idea of containerization is similar to virtual machines, this Docker page explains the differences.

Docker

To install Docker on an Ubuntu system, currently the following commands will do the trick:

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable" 
sudo apt-get update
sudo apt-get install -y docker-ce

Installation of Docker on Mac OSX is straightforward with the download of the free community edition (CE) from here. Make sure you either add the software as a log-in item or manually start it before you attempt to use it. (You will see the little Docker whale in the task bar. )

Some key Docker commands are

# show all images available locally:
docker images
# show all containers that are already running:
docker ps
# start a new container from an image that can be fetched from the remote Docker repository.
# a simple test:
docker run hello-world
# or a small Linux system:
docker run --rm -it phusion/baseimage:0.9.22 /sbin/my_init -- bash 

A good way to move data in and out of the container is by mounting a specific directory, e.g./home/felix/testproject to the tmp folder:

docker run --rm -it --mount type=bind,source=/home/felix/testproject,target=/tmp phusion/baseimage:0.9.22 /sbin/my_init -- bash

The standard way to create new images is by defining all installation steps in a Dockerfile. However, to share a pre-build environment with your own data it is sometimes easier to freeze the container you have build up. Your can do this by starting a base system, e.g. the phusion/baseimage mentioned above, installing all the software and data you like and exporting the container:

# find container ID:
docker ps
CONTAINER ID        IMAGE                      COMMAND                  CREATED       
c24c11050fc5        phusion/baseimage:0.9.22   "/sbin/my_init -- ..."   4 seconds ago 
# export the current state in a compressed archive:
docker export c24c11050fc5 | gzip > my_new_image.tar.gz
# import and run the image later or on a different computer:
docker import my_new_image.tar.gz my_container
docker run -it my_container /sbin/my_init -- bash

 

Singularity

Installation on Mac OSX requires the Vagrant and the VirtualBox system first. Instead of using the brew system often recommended, I found it better to install the dmg packages directly from the Vagrant site and the Oracle VirtualBox site. Good instructions for different systems are also given here.
After installation you can start the VirtualBox Manager and set up an Ubuntu image:

# get the image:
vagrant init ubuntu/trusty64
# start the virtual machine:
vagrant up
# get to the command line:
vagrant ssh

or directly start a Singularity image with:

vagrant init singularityware/singularity-2.4
vagrant up
vagrant ssh

You can stop a VM running in the background with

vagrant halt

It is possible to use Docker images in Singularity, you can pull from the Docker hub and build a local image:

sudo singularity build phusion-singularity-image.simg docker://phusion/baseimage:0.9.22

 

This should get you started with containers.
Make sure to keep an eye on disk consumption, in particular Docker data seems to grow significantly in the background (See issue here)! I prefer to move Docker (the “qcow2” file) to a fast external disk.

Checking files & directories in your Bash script

It’s good practise to perform all kinds of tests on files that you are processing in a Bash script, in particular if it is a file that the user has provided. Here is a reminder of the test flags you can use as part of the test command:

Switches to check:

-b filename – Block special file
-c filename – Special character file
-d directoryname – Check for directory Existence
-e filename – Check for file existence, regardless of type (node, directory, socket, etc.)
-f filename – Check for regular file existence not a directory
-G filename – Check if file exists and is owned by effective group ID
-G filename set-group-id – True if file exists and is set-group-id
-k filename – Sticky bit
-L filename – Symbolic link
-O filename – True if file exists and is owned by the effective user id
-r filename – Check if file is a readable
-S filename – Check if file is socket
-s filename – Check if file is nonzero size
-u filename – Check if file set-user-id bit is set
-w filename – Check if file is writable
-x filename – Check if file is executable

How to use:

#!/bin/bash
file=./file
if [ -e "${file}" ] ; then
    echo "File exists"
else 
    echo "File does not exist"
fi 

test expression can be negated by using the ! operator

#!/bin/bash
file=./file
if [ ! -e "${file}" ] ; then
    echo "File does not exist"
else 
    echo "File exists"
fi 

Source: GroundZero @ StackOverflow
Note: To be risk-adverse I like to use double brackets [[ ]] around the tests and brackets {} around the variable names.

Additionally it might be useful here to learn additional comparison operators:

[ ( EXPR ) ]Returns the value of EXPR. This may be used to override the
normal precedence of operators.
[ EXPR1 -a EXPR2 ]True if both EXPR1 and EXPR2 are true.
[ EXPR1 -o EXPR2 ]True if either EXPR1 or EXPR2 is true.
[ -z STRING ]True of the length if “STRING” is zero.
[ -n STRING ] or [ STRING ]True if the length of “STRING” is non-zero.
[ STRING1 == STRING2 ]True if the strings are equal. “=” may be used instead
of “==” for strict POSIX compliance.
[ STRING1 != STRING2 ]True if the strings are not equal.
[ ARG1 OP ARG2 ]“OP” is one of -eq-ne-lt-le-gt or -ge.
These arithmetic binary operators return true if “ARG1” is equal to,
not equal to, less than, less than or equal to, greater than, or greater
than or equal to “ARG2”, respectively. “ARG1” and “ARG2” are integers.

Source

Indrops config 2

Indrops config 2

An example for running the OneCellPipe pipeline of OneCellBio with a pre-defined configuration file for sequencing runs with a single part, and a single library.
Files:

  • R1.fastq.gz
  • R2.fastq.gz

Command line:

nextflow onecellpipe.nf  --config /data/onecellpipe/data_results/indrop_config_to_use.yaml --out /data/onecellpipe/data_results

indrop_config_to_use.yaml:

# project and library settings
project_name : "libA5"
project_dir : "/data/onecellpipe/data_results"
sequencing_runs :
  - name : "libA5"
    version : "v2"
    dir : "/data/onecellpipe/data"
    fastq_path : "{read}.fastq.gz"
    library_name: "libA5"
# standard indrops config
# part 1: general software paths within the container, do not change
paths : 
  bowtie_index : '/home/onecellbio/ref/Homo_sapiens.GRCh38.91.annotated'
  bowtie_dir : '/home/onecellbio/bowtie'
  rsem_dir : '/home/onecellbio/RSEM/bin'
  python_dir : '/home/onecellbio/pyndrops/bin'
  indrops_dir : '/home/onecellbio/indrops'
  java_dir : '/usr/bin'
  samtools_dir : '/home/onecellbio/samtools-1.3.1'
# part 2: analysis parameters
parameters : 
  umi_quantification_arguments:
    m : 10 #Ignore reads with more than M alignments, after filtering on distance from transcript end.
    u : 1 #Ignore counts from UMI that should be split among more than U genes.
    d : 600 #Maximal distance from transcript end, NOT INCLUDING THE POLYA TAIL
    split-ambigs : False  #If umi is assigned to m genes, add 1/m to each genes count (instead of 1)
    min_non_polyA : 15  #Require reads to align to this much non-polyA sequence. (Set to 0 to disable filtering on this parameter.)
  output_arguments :
    output_unaligned_reads_to_other_fastq : False
    filter_alignments_to_softmasked_regions : False
  bowtie_arguments :
    m : 200
    n : 1
    l : 15
    e : 80
  trimmomatic_arguments :
    LEADING : "28"
    SLIDINGWINDOW : "4:20"
    MINLEN : "30"
    argument_order : ['LEADING', 'SLIDINGWINDOW', 'MINLEN']
  low_complexity_filter_arguments :
    max_low_complexity_fraction : 0.50

An other more complicated example for a single part but multiple libraries is following.
Files:

  • A5_S12_L001_R1_001.fastq.gz
  • A5_S12_L001_R2_001.fastq.gz
  • A5_S12_L002_R1_001.fastq.gz
  • A5_S12_L002_R2_001.fastq.gz
  • A6_S1_L001_R1_001.fastq.gz
  • A6_S1_L001_R2_001.fastq.gz
  • A6_S1_L002_R1_001.fastq.gz
  • A6_S1_L002_R2_001.fastq.gz

Command line:

nextflow onecellpipe.nf  --config /data/onecellpipe/data_results/indrop_config_to_use.yaml --out /data/onecellpipe/data_results_2

indrop_config_to_use.yaml:

# project and library settings
project_name : "libA5"
project_dir : "/data/onecellpipe/data_results_2"
sequencing_runs :
  - name : "libA5"
    version : "v2"
    dir : "/data/onecellpipe/more_data"
    fastq_path : "{read}.fastq.gz"
    split_affixes : ["L001", "L002"]
    libraries : 
      - {library_name: "A5", library_prefix: "A5_S12"}
      - {library_name: "A6", library_prefix: "A6_S1"}
# standard indrops config
# part 1: general software paths within the container, do not change
paths : 
  bowtie_index : '/home/onecellbio/ref/Homo_sapiens.GRCh38.91.annotated'
  bowtie_dir : '/home/onecellbio/bowtie'
  rsem_dir : '/home/onecellbio/RSEM/bin'
  python_dir : '/home/onecellbio/pyndrops/bin'
  indrops_dir : '/home/onecellbio/indrops'
  java_dir : '/usr/bin'
  samtools_dir : '/home/onecellbio/samtools-1.3.1'
# part 2: analysis parameters
parameters : 
  umi_quantification_arguments:
    m : 10 #Ignore reads with more than M alignments, after filtering on distance from transcript end.
    u : 1 #Ignore counts from UMI that should be split among more than U genes.
    d : 600 #Maximal distance from transcript end, NOT INCLUDING THE POLYA TAIL
    split-ambigs : False  #If umi is assigned to m genes, add 1/m to each genes count (instead of 1)
    min_non_polyA : 15  #Require reads to align to this much non-polyA sequence. (Set to 0 to disable filtering on this parameter.)
  output_arguments :
    output_unaligned_reads_to_other_fastq : False
    filter_alignments_to_softmasked_regions : False
  bowtie_arguments :
    m : 200
    n : 1
    l : 15
    e : 80
  trimmomatic_arguments :
    LEADING : "28"
    SLIDINGWINDOW : "4:20"
    MINLEN : "30"
    argument_order : ['LEADING', 'SLIDINGWINDOW', 'MINLEN']
  low_complexity_filter_arguments :
    max_low_complexity_fraction : 0.50

 

Even more complicated: An example for multiple runs (for the same samples in different directories) and multiple libraries is following.
Files:

  • Run1/A5_S12_L001_R1_001.fastq.gz
  • Run1/A5_S12_L001_R2_001.fastq.gz
  • Run1/A5_S12_L002_R1_001.fastq.gz
  • Run1/A5_S12_L002_R2_001.fastq.gz
  • Run1/A6_S1_L001_R1_001.fastq.gz
  • Run1/A6_S1_L001_R2_001.fastq.gz
  • Run1/A6_S1_L002_R1_001.fastq.gz
  • Run1/A6_S1_L002_R2_001.fastq.gz

and

  • Run2/A5_S12_L001_R1_001.fastq.gz
  • Run2/A5_S12_L001_R2_001.fastq.gz
  • Run2/A5_S12_L002_R1_001.fastq.gz
  • Run2/A5_S12_L002_R2_001.fastq.gz
  • Run2/A6_S1_L001_R1_001.fastq.gz
  • Run2/A6_S1_L001_R2_001.fastq.gz
  • Run2/A6_S1_L002_R1_001.fastq.gz
  • Run2/A6_S1_L002_R2_001.fastq.gz

Command line:

nextflow onecellpipe.nf  --config /data/onecellpipe/data_results/indrop_config_to_use.yaml

indrop_config_to_use.yaml:

# project and library settings
project_name : "libA5"
project_dir : "/data/onecellpipe/data_results_2"
sequencing_runs :
  - name : "Run1"
    version : "v2"
    dir : "/data/onecellpipe/more_data/Run1"
    fastq_path : "{read}.fastq.gz"
    split_affixes : ["L001", "L002"]
    libraries : 
      - {library_name: "A5", library_prefix: "A5_S12"}
      - {library_name: "A6", library_prefix: "A6_S1"}
  - name : "Run2"
    version : "v2"
    dir : "/data/onecellpipe/more_data/Run1"
    fastq_path : "{read}.fastq.gz"
    split_affixes : ["L001", "L002"]
    libraries : 
      - {library_name: "A5", library_prefix: "A5_S12"}
      - {library_name: "A6", library_prefix: "A6_S1"}
# standard indrops config
# part 1: general software paths within the container, do not change
paths : 
  bowtie_index : '/home/onecellbio/ref/Homo_sapiens.GRCh38.91.annotated'
  bowtie_dir : '/home/onecellbio/bowtie'
  rsem_dir : '/home/onecellbio/RSEM/bin'
  python_dir : '/home/onecellbio/pyndrops/bin'
  indrops_dir : '/home/onecellbio/indrops'
  java_dir : '/usr/bin'
  samtools_dir : '/home/onecellbio/samtools-1.3.1'
# part 2: analysis parameters
parameters : 
  umi_quantification_arguments:
    m : 10 #Ignore reads with more than M alignments, after filtering on distance from transcript end.
    u : 1 #Ignore counts from UMI that should be split among more than U genes.
    d : 600 #Maximal distance from transcript end, NOT INCLUDING THE POLYA TAIL
    split-ambigs : False  #If umi is assigned to m genes, add 1/m to each genes count (instead of 1)
    min_non_polyA : 15  #Require reads to align to this much non-polyA sequence. (Set to 0 to disable filtering on this parameter.)
  output_arguments :
    output_unaligned_reads_to_other_fastq : False
    filter_alignments_to_softmasked_regions : False
  bowtie_arguments :
    m : 200
    n : 1
    l : 15
    e : 80
  trimmomatic_arguments :
    LEADING : "28"
    SLIDINGWINDOW : "4:20"
    MINLEN : "30"
    argument_order : ['LEADING', 'SLIDINGWINDOW', 'MINLEN']
  low_complexity_filter_arguments :
    max_low_complexity_fraction : 0.50

Allow another user to connect to my EC2 cloud machine

There might be times when you either have something fascinating happening on the Amazon cloud machine that you set up, or – more likely – you got stuck with a problem a friend might be able to help you with. Here are notes on how to set up a shared access for a standard (Ubuntu linux) computer.

If you set up your EC2 machine securely, it will not allow anyone but you to access it though: The “security group” used allows your IP only.  (If you set it up insecurely with Source = 0.0.0.0/0 your friend – and anyone else – will be able to access it directly!) There is however the option to modify this security group, and after a few minutes delay it will be applied even to running machines! So all you need to do to allow your friend to work alongside of you is to add his IP address and add his public SSH key:

Part 1 (see screenshot below):

  • Find out the additional IP address, e.g. with www.iplocation.net/find-ip-address from your friend’s computer, e.g. 79.217.24.86
  • Go to the EC2 management console
  • In the navigation pane, choose Network Interfaces.
  • Select the network interface and choose Actions >> Edit inbound rules
  • Add a new SSH rule with the additional IP address, e.g. 79.217.24.86/32 and save it. 

Part 2:

  • copy his public SSH key to the machine (using your own private key):

    scp -i "~/.ssh/your-key-region1.pem" friends_key.pub ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com:~/
  • log into the machine (with your own private key):

    ssh -i "~/.ssh/your-key.pem" ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com
  • add the new key to your existing keys:

    cat friends_key.pub >> .ssh/authorized_keys

He can now log in with his own (private) key:

ssh -i "~/.ssh/friends_key.pem" ubuntu@ec2-54-236-163-221.compute-1.amazonaws.com

 

This description was based on this help page.