Running the OneCellPipe software in the Amazon cloud

Running the OneCellPipe software in the Amazon cloud

This is a sub-page of the multi-page documentation for the OneCellPipe pipeline of OneCellBio.

Here are more detailed instruction for running OneCellPipe on an single machine in the Amazon cloud.

Please note that there AWS will charge for running this type of machine and storage, the costs are usually fairly low though.

We are assuming you have created an account for Amazon Web Services (AWS). (To get started, you can use the free tier offer. You will need an address, an email address and credit card information for that.)
Here are the steps to set  up a machine with a spot request (which is cheaper than a regular request, but also might terminate unexpectedly) and either

  1. Use a standard Ubuntu system and install the software required (option A) or
  2. Use the OneCellPipe system we provide (option B)

to run the single-cell inDrops pipeline on there.

We strongly recommend using option B, ie. our prepared system (AMI)  as this will be the easiest way for you to get started and has been tested by us.

 

Part 1: Starting a machine in the Amazon cloud

  1. Log in to your AWS account at https://aws.amazon.com
  2. Fig. 1

    In the header bar, go to “Services” / “EC2” (fig. 1)

  3. Choose “Spot Request” on the left side menu and click on “Request Spot Instances”
  4. Fig. 2

    We will now set the machine details. Please note that you cannot use the machines offered as part of the “Free Tier” as the alignment requires more than 1 GB of RAM.

    1. AMI:
      – option A: select “Canonical, Ubuntu 16.04 LTS”
      – option B: Click on “Search for AMI”, select “AMIs shared with me”, select “OneCellPipe-Image”
    2. Instance type. We should choose an instance with 2 CPUs and 8 GB of memory for a test data set. For real data sets with fastq files of a few GB, choose a machine with e.g. 8 CPUs and 30 GB of RAM. You can then sort by the spot price to choose a cheap one, eg. “m4.large” (fig. 2) At this moment it will cost us 3 Cent per hour started in our example. You can also check the “Pricing History” to make sure this machine type does not tend to spike in price in your chosen zone.
    3. Fig. 3

      This machine has no disk space included, so we add e.g. 200 to 500 GB of HDD disk space (fig.3) If your data set is small, 60 GB might be enough. Expect to use four times the space your raw (compressed) fastq files. (Disk space adds additional costs.)

    4. Security groups: Make sure you choose a security group that allows SSH traffic, otherwise you won’t be able to connect to your instance. You can easily create a new security group using the link on the current AWS page.
      Fig. 4

      There, make sure you allow “All outbound traffic” and port 22 for inbound SSH traffic, you can allow everyone to connect or restrict to specific IP adresses (fig. 4)

    5. “Key pair name”: Choose an SSH key file that you have access to or create a new one. You will need to refer to it in the SSH command.
  5. “Launch” the request. It will take a few seconds or minutes to fulfill and a minute to set up the machine. You can refresh the display with the arrows image on the top right.
  6. As soon a the request changes from “pending_fulfillment” to “fulfilled” you can click on “Instances” on the left menu to see your new machine. Clicking “Connect” here will show you the “Example” line to use on the SSH command-line, and the “Public DNS” to specify in your SSH client (e.g. PuTTY)
  7. The machine will go through a few checks, once done you can connect using this DNS (the IP address), your SSH certificate file and the user name “ubuntu”.
    1. Option 1: If you are connecting via the command-line you can simply type something like:

      ssh -i "/users/fred/.ssh/aws-key.pem" ubuntu@ec2-32-239-1-148.compute-1.amazonaws.com

         /users/fred/.ssh/aws-key.pem is the path to your SSH key that you have specified in step 4.5.
         ec2-32-239-1-148.compute-1.amazonaws.com is the machine address given in step 6.

    2. Fig. 5

      Option 2: If you prefer an SSH client software like PuTTY (fig. 5), here is some help for the installation and here are instructions for the correct connection.

  8. You are now in the cloud!

 

Part 2: Installations

For option A only:
    1. Install the required software
    2. Install optional software:
      Use s3fuse to connect to an S3 cloud storage location (recommended):

      sudo apt install -y automake autotools-dev fuse g++ git libcurl4-gnutls-dev libfuse-dev libssl-dev libxml2-dev make pkg-config
      git clone https://github.com/s3fs-fuse/s3fs-fuse.git
      cd s3fs-fuse; ./autogen.sh; ./configure; make; sudo make install
      cd ..
    3. Increase memory tolerance (if necessary)
      Add a 2 Gigabyte”swap” file to the system (adjust size to suit your machine):

      sudo dd if=/dev/zero of=/var/swapfile bs=1M count=2048
      sudo chmod 600 /var/swapfile
      sudo mkswap /var/swapfile
      echo /var/swapfile none swap defaults 0 0 | sudo tee -a /etc/fstab
      sudo swapon -a

      Add a system setting to limit the memory Java processes should use (here: minimum 1Gb, maximum 9 Gb, adjust if necessary):

      echo export NXF_OPTS="-Xms1G -Xmx9G" >> ~/.bashrc; source ~/.bashrc

For both options, if you want to access your S3 data:

  1. Add security credentials for AWS / S3 and mount your S3 bucket into a directory within the onecellpipe directory.
    Replace all CAPITAL words (AWS access key, secret password and bucket name) with your own settings.

    # make the S3 password available:
    echo ACCESS-KEY:SECRET-KEY > /home/ubuntu/onecellpipe/s3.pass
    chmod 600 /home/ubuntu/onecellpipe/s3.pass
    # mount S3 as a new directory "s3data":
    mkdir /home/ubuntu/onecellpipe/s3data
    s3fs YOUR-BUCKET-NAME /home/ubuntu/onecellpipe/s3data -o passwd_file=/home/ubuntu/onecellpipe/s3.pass -o allow_other -o umask=000

    You can manage your access key via users: click on your user name at the top of the AWS browser page and select “My Security Credentials”, then “Users”. Create a new access key and note down the Access key ID and Secret access key. They should look somewhat like IASDJASD8SAFD6SADF and s98JSAD7ASDdssd7UASDASDya4jh3aS

  2. If you save this machine as your own AMI you can skip all of these steps next time you want to run the analysis and just launch your own image! (Additional AWS storage costs may occur.)

 

Part 3: Run the test pipeline

Use the sample data provided to test that everything is working.

Here is a real-time screen cast of the processing of the sample data:

Part 4: Process your own data

  1. Upload your data if it is not in your S3 bucket
    To upload your data to the cloud machine you can use scp from your local machine. Some alternatives are provided here. In this example I have compressed all FASTA files from the directory run1-fasta-files to an archive fasta.tgz:

    tar czvf fasta.tgz run1-fasta-files

    Open a new terminal window and modify the ssh line you used to connect slightly to use scp to upload fasta.tgz to your pipeline directory /home/ubuntu/onecellpipe/:

    scp -i "/users/fred/.ssh/aws-key.pem" fasta.tgz ubuntu@ec2-32-239-1-148.compute-1.amazonaws.com:/home/ubuntu/onecellpipe/
  2. Extract your data and run the pipeline:

    tar xzvf fasta.tgz
    nextflow onecellpipe.nf --dir run1-fasta-files

    or with a config file in the format expected by indrops:

    nextflow onecellpipe.nf --config run1.yaml

    It is usually a good idea to start the process within a screen window to avoid ending your pipeline through an interruption of the connection!

  3. Optional: Use an S3 bucket
    Data you have stored in the Amazon cloud storage S3 can be accessed directly on the command line of the cloud machine if you have mounted the directory as described in Part 2, step 6. It’s recommended to direct the output to a directory on the machine (using the –out option) though:

    nextflow onecellpipe.nf --dir /home/ubuntu/onecellpipe/s3data/YOUR-BUCKET-NAME/run1-fasta-files --out /home/ubuntu/pipe-results

See also:

Part 5: Shut down the machine

Never forget to go back to the list of instances on the AWS page and shut down the machine by selecting it and choosing ActionInstance State, Terminate.
This will stop the costs of running the machine. It will also interrupt your connection(s) and delete any data you put on the cloud machine, so make sure you copy your results back to your own machine or to S3.

 

 

Installing the software required for the OneCellPipe system

Installing the software required for the OneCellPipe system

Here are more detailed instruction for installing and running the OneCellPipe software and its requirements. These instructions are for a Linux Ubuntu system where you have sudo rights, please adjust accordingly if you are using a different flavor of a Unix-based operating system! This is a sub-page of the multi-page documentation for the OneCellPipe pipeline of OneCellBio.

Installation

  1. If necessary, update your system and install Python 2 and build packages – or ask your system administrator for it if you are not allowed to run sudo commands.
    python –version should show something like:
    Python 2.7.12

    sudo apt update
    sudo apt install -y python build-essential
  2. If necessary, update / install Java 8 or ask your system administrator for it.
    sudo apt install -y default-jre --fix-missing
  3. Install NextFlow
    wget -qO- get.nextflow.io | bash
    

    You can of course use curl if wget is not available. In order to run the software from anywhere in the file system move it to a directory specified in your path settings:

    sudo mv nextflow /usr/local/bin/

    or add the current location to your path if you do not have sudo permissions. For the bash shell this could be:

    echo export PATH=$PATH:$PWD >> ~/.bashrc; source ~/.bashrc

    You can print out your current path settings with: echo $PATH

  4. Install Singularity
    We are using the version 2.5.1 at this time (check other versions here, but there might occur format incompatibilities). The installation of Singularity requires sudo rights to work properly.

    VERSION=2.5.1
    wget https://github.com/singularityware/singularity/releases/download/$VERSION/singularity-$VERSION.tar.gz
    tar xvf singularity-$VERSION.tar.gz
    cd singularity-$VERSION
    ./configure --prefix=/usr/local
    make
    sudo make install
    cd ..
    rm singularity-$VERSION.tar.gz

    In case there is an error like “configure: error: Unable to find the libarchive headers”, please also install:

    sudo apt install -y libarchive-dev
  5. Install the OneCellBio pipeline
    We are using the current version 1.21, adjust if necessary.

    VERSION=1.21
    wget https://s3.amazonaws.com/gt-datastorage/onecellpipe/onecellpipe.$VERSION.tgz
    tar xzf onecellpipe.$VERSION.tgz
    rm onecellpipe.$VERSION.tgz

Run the test pipeline

Use the sample data provided to test that everything is working:

cd ~/onecellpipe
nextflow onecellpipe.nf --dir sampledata

This will start the process producing out like the following:

N E X T F L O W  ~  version 0.27.0
Launching `onecellpipe.nf` [lonely_ekeblad] - revision: 4daa120f75

======================================
O N E C E L L B I O    P I P E L I N E
======================================
Pipeline  version : 1.21
Pipeline directory: /home/ubuntu/onecellpipe

Project directory : sampledata
Log file path     : /home/ubuntu/onecellpipe/sampledata/indrop_log.txt
Timestamp         : Mon Jan 15 12:29:08 UTC 2018

[warm up] executor > local
[18/fe30f8] Submitted process > setup (Setup procedure)

This will run in about 5-20 minutes, depending on your number of CPUs/core. An example timeline for a machine with 8 cores can bee seen here.
You can interrupt and end the pipeline by hitting <Ctrl>C.
You can try to resume the pipeline by adding -resume to your command.

Common question about the OneCellPipe system

Common question about the OneCellPipe system

This is a sub-page of the multi-page documentation for the OneCellPipe pipeline of OneCellBio.

Here is a collection of frequently asked questions and potential error messages when running the single-cell processing pipeline.

Questions

  1. How do I free up disk space after running the pipeline?
    Answer:
    Once you are done, you can remove temporary files in your analysis folder. For the sample data set this would be:

    rm -rf sampledata/A5/{filtered_parts, quant_dir}

    You can also remove the Nextflow cache:

    nextflow clean
  2. How do I save / store / backup my results?
    Answer:
    The easiest would be to compress and copy the entire output folder (if specified with –out) or working folder (if only –dir was specified), e.g. with

    tar czvf onecellpipe-results.tgz result-folder

    Alternatively just save the main result files, replacing LIB-NAME with your library name:

    tar czvf onecellpipe-results.tgz result-folder/resources result-folder/*.* result-folder/LIB-NAME/*.*
  3. I started the pipeline, but nothing seems to happen.
    Answer:
    When run for the first time the container image has to be downloaded. This can take several minutes.
    This can also happen when the container image was removed from the system.
  4. I have a lot of fastq files to process in the cloud. How can I get them into S3?
    Answer:
    Have a look at these options.
  5. Should I use Docker or Singularity?
    Answer:
    We provide both options to accommodate integration into different IT infrastructures. There is no significant performance difference. If you have local system administrators, ask for their preference. For compute clusters it is sometimes better to use Singularity to avoid giving sudo permissions. The memory for Singularity is also slightly smaller.
  6. How can I speed up the pipeline?
    – Increase parallel processing by using a compute cluster of a machine with more CPUs / cores. You can then look at the number of parallel jobs.
    – Don’t run the QC steps if you don’t need them: –qc 0 (default)
    – Don’t create the transposed count matrix if you don’t need it: –transpose 0 (default)
    – Don’t create the BAM files after quantification if you don’t need them: –bam 0(default) 
  7. I accidentally stopped the pipeline, what can I do?
    As long as no files in the cache of Nextflow have changed, it is ofter possible to jump right back to the last successful step by repeating the same command and adding -resume.
  8. I lost the connection to my server, what can I do?
    Reconnect, use the screen command and try to resume the pipeline (by repeating the same command and adding -resume). Disconnect the screen by pressing <Ctrl><a><d> to avoid another interruption.
  9. I specified –email <email@address.com> but did not receive a notification!
    -Does your system support sendmail?
    -Unless you set up a SMTP details at the bottom of the nextflow.config file, mails often get blocked as spam! Try a Gmail address. Have a look in your sendmail folder, e.g. with .  less /var/mail/<username> if your notification got stuck!

 

Potential Error Messages

    1. Bowtie error 1
      Error: Could not allocate ChunkPool of 1048576000 bytes
      Warning: Exhausted best-first chunk memory for read 
      ...
      Exception: 
      === Error on piping data to bowtie ===

      Solution:
      Bowtie requires more at least 2 GB of RAM just to load the genome index.
      Please use a machine with at least 4 GB of RAM.

    2. Container software error
      Container software singularity could not be found, please make sure it is running or download 
       and install it ...
      

      Solution 1:
      You did not install container software on the machine you are running the pipeline on or you did not activate / start it.
      Solution 2:
      You are trying to use Docker as container software, but you did not specify —docker 1   on the command-line.

    3. Container software error
      Pipeline execution stopped with the following message: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get  (...) dial unix /var/run/docker.sock: connect: permission denied
      

      Solution:
      You are running on a system where the Docker software requires higher permissions.
      Add the following to your command line:

      --sudo 1
    4. Input error 1
      INFO:	Scanning directory testdata2
      ERROR:	Could not find fastq files.
      

      Solution:
      Provide the absolute path to your project directory and make sure there are fastq files in there with the parameter –dir /home/jdoe/data/fastqs.

    5. Input error 2

      ERROR:	Could not identify file name pattern.
        Please use configuration file

      Solution:
      The names of your fastq files are different to the default expected by the automatic setup script:

      {library_prefix}_{split_affix}_{read}_001.fastq.gz

      Either change the file names or provide a configuration file using the –config option.

    6. Alignment error
      Traceback (most recent call last):
        File "/home/onecellbio/indrops/indrops.py", line 1724, in 
          no_bam=args.no_bam, run_filter=target_runs)
        File "/home/onecellbio/indrops/indrops.py", line 818, in quantify_expression
          min_counts = min_counts, run_filter=run_filter)
        File "/home/onecellbio/indrops/indrops.py", line 928, in quantify_expression_for_barcode
          raise Exception("\n === No aligned bam was output for barcode %s ===" % barcode)
      Exception: 
       === No aligned bam was output for barcode bcDFJI ===
      

      Solution:
      This indrops error seems to occur when the number of jobs for the last step are not appropriate for the amount of data. Try reducing the –workers2 number.

    7. AWS connection error
      When trying to connect to an Amazon cloud machine using your SSH key file you might see the following error:

      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      Permissions 0644 for '/Users/fred/.ssh/aws-key.pem' are too open.
      

      Solution:
      Change permission so that only you are allowed to access this file:

      chmod 400 /Users/fred/.ssh/aws-key.pem
    8. Setup error
      INFO: Setting up analysis container onecellpipe-25-2.tar.gz.
      ERROR: There was a problem importing the Docker image.Command error:
      Error response from daemon: Error processing tar file(exit status 1):
        write /home/onecellbio/ref/Mus_musculus.GRCm38.91.annotated.n2g.idx.fa: no space left on device
      

      Solution:
      You need more disk space in order to run the pipeline since the genome index files are large and you will need additional space for your results!

    9. Parameter error
      Unknown option: XXXX -- Check the available commands and options and syntax with 'help'
         OR
      ERROR ~ Unknown parameter "XXXX"

      Solution:
      Check which parameters can be used.
      General Nextflow parameter are passed with a single dash, e.g. -with-timeline.
      OneCellPipe parameters are passed using double dashes, e.g. –dir /fastq/dir

    10. Input problems

      gzip: /home/ubuntu/somefolder/sampledata/A5_S12_L001_R2_001.fastq.gz: No such file or directory
      ...
      Command error:
        .command.run.1: line 99:    12 Terminated              nxf_trace "$pid" .command.trace

      On some systems there are problems if the input folder is not at the level of the nextflow pipeline. Move the fastq folder to the current directory and start again.
      This can also happen if the fastq files are only provided via links.

Creating a Singularity image when Docker is misbehaving

Today’s task was to update a docker image by adding some data, committing the changes to the Docker-Hub and creating image files for Docker and for Singularity (version 2.4-dist).
I’m usually performing these tasks on an Ubuntu machine on the AWS cloud. Trying to follow my steps from a previous time, I first start my container and perform the update within:

sudo docker run -it --rm -m 4g --mount type=bind,source=/home/ubuntu,target=/tmp  --name projectname-1 $DOCKER_ID_USER/projectname:2 /bin/bash
# ...

Checking the container ID and committing and pushing the changes to Docker-Hub all seems fine:

sudo docker ps
sudo docker commit 6a024ac35ab5 $DOCKER_ID_USER/projectname:2
sudo docker push $DOCKER_ID_USER/projectname:2

But trying to pull down the image using Docker or Singularity fails repeatedly:

sudo docker pull $DOCKER_ID_USER/projectname:2
invalid reference format
singularity pull docker://$DOCKER_ID_USER/projectname:2
Importing: base Singularity environment
Importing: /home/ubuntu/.singularity/docker/sha256:223cbef2a1193a2c9ab9dac0195ff0dcbbe2067e724f46a5fbe8473dda842b71.tar.gz
gzip: /home/ubuntu/.singularity/docker/sha256:223cbef2a1193a2c9ab9dac0195ff0dcbbe2067e724f46a5fbe8473dda842b71.tar.gz: not in gzip format
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

The name and format of my project have not changed however! After some head-scratching I found that I could actually perform the tasks of creating the Singularity images on my local machine (an Apple MacBook Pro running High Sierra), following this Singularity page:

cd /Users/fsk/singularity-vm/
vagrant destroy
rm Vagrantfile
vagrant up --provider virtualbox

Checking with singularity –version shows version 2.4-dist is running successfully!
Pulling with Singularity nicely creates my Singularity img file from the Docker image:

sudo singularity pull docker://$DOCKER_ID_USER/projectname:2
...
Singularity container built: ./projectname-2.img
...

What are the alternatives?

  • Using docker2singularity created a file twice the size and was therefore not useful!
  • Re-building from a Docker file or from a Singularity recipe file would have been very tedious, but would probably work.

To produce the Docker images, I was able to export from the running container on the Ubuntu machine:

sudo docker export 6v034ac35ab5 | gzip > projectname-2.tar.gz

All is good…