Big Data with SQLite

I recently wanted to process genome-wide sequence data that I had extracted into 24 files ranging in size from 3 GB (Gigabytes) for the small chromosome Y to 26 GB for the largest chromosome 1. Further processing this data was challenging as any software script reading the files has to do it in a stream, i.e. line by line or in chunks – or use immense amounts of memory. The actual task was to extract the set of sequences from all chromosome files that are unique across the genome. Command-line tools to sort / make unique / combine would be running for many days if they would finish at all.

The solution was to make use of a database management system (DBMS). There are more powerful DBMS, but the free and simple-to-use file-based database system SQLite can handle a lot of data as well. Theoretically its file can grow up to 140 terabytes and more (ref). I had to optimize the setup in the following ways:

1. Install the latest SQLite version, compiling it as described here. For the Mac OS X it included the following:

conda install -c conda-forge fossil
fossil clone sqlite.fossil
fossil open sqlite.fossil
fossil update release
mkdir bld
cd bld
make sqlite3.c
make test

2. Add the “virtual tables” extension, compiling csv.c from

gcc -g -I. -fPIC -dynamiclib /home/felix/sqlite_extensions/csv.c -o /home/felix/sqlite_extensions/csv.dylib

3. When setting up the database, use PRAGMA settings to improve the performance of the DBMS and then load the extension:

sqlite3 sequences.db
>PRAGMA main.page_size = 4096;
>PRAGMA main.cache_size = 10000;
>PRAGMA main.locking_mode = EXCLUSIVE;
>SELECT load_extension('/home/felix/sqlite_extensions/csv.dylib');

4. Load the csv data into a virtual / temporary table first – this is extremely fast. Perform any operations on this table while inserting the data into a final table in the database. This could for example involve pulling out only unique sequences:

>CREATE VIRTUAL TABLE temp.t1 USING csv(filename="/home/felix/data/chrom1.csv");
>CREATE TABLE sequences (
    seqstring VARCHAR NOT NULL
INSERT INTO sequences(seqstring) SELECT DISTINCT c0 from t1;

If you need indices, add them afterwards!
Trying to remove duplicates by loading into a table with a “UNIQUE” column was too slow, btw.

5. The next steps are then to load data from other chromosomes into this table “sequences” and then pull them out into a new table “unique_sequences” to remove duplicates between different chromosomes as well:

>DROP TABLE temp.t1;
>CREATE VIRTUAL TABLE temp.t1 USING csv(filename="/home/felix/data/chrom2.csv");
>CREATE TABLE sequences (
   seqstring VARCHAR NOT NULL
INSERT INTO sequences(seqstring) SELECT DISTINCT c0 from t1;

# and so on for other chromosomes
# ...

>CREATE TABLE unique_sequences (
    seqstring VARCHAR NOT NULL
INSERT INTO unique_sequences(seqstring) SELECT DISTINCT seqstring from sequences;

In my case it was better to use a number of intermediate databases to be able to run the processing in parallel and to keep the file size smaller. The data from the different databases can then be combined into the final db with statements like this:

sqlite3 sequences.db
>PRAGMA main.page_size = 4096;
>PRAGMA main.cache_size = 10000;
>PRAGMA main.locking_mode = EXCLUSIVE;

>ATTACH DATABASE 'sequences.1.db' AS db1;
>INSERT INTO unique_sequences(seqstring) SELECT DISTINCT seqstring from db1.unique_sequences;

The final sqlite database was around 150 GB, but the performance was fine to extract sequences as needed.

Telomeric Regions of the Human Genome

Telomeres form caps on the ends of chromosomes that prevent fusion of chromosomal ends and provide genomic stability.

During gametogenesis, reprogramming of the germ cells leads to elongation of telomeres up to their species-specific maximum.

In normal somatic cells, telomeres are progressively shortened with every cell division. This shortening in normal human cells limits the number of cell divisions. For human cells to proliferate beyond the senescence checkpoint, they need to stabilize telomere length. This is accomplished mainly by reactivation of the telomerase enzyme. Telomerase expression is under the control of many factors. Expression of telomerase can lead to cell immortalization and is activated during tumorigenesis, i.e. cancer.

Male Xq-telomeres are 1100 bp shorter than female Xq-telomeres.

The telomeric repeat found on all human chromosomes is “TTAGGG”.

The centromeres and telomeres of the human chromosomes are not defined as region attributes in the Ensembl perl API explicitely, so for checking these regions, one option is to pull them out of the UCSC table browser. For this, select the assembly and use the “Mapping and Sequencing tracks” group and the “Gap” table. The extracted locations of the human telomere regions is provided below for the genome assemblies GRCh37 (hg19) and GRCh38 (hg38). The coordinates are given in the 0-based UCSC coordinated system.

Telomere locations

Telomeres of chromosome 17 have not been defined for assembly GRCh37. They are short, but do exists nonetheless. An assembly patch will address this.

[table id=7 /]

Centromer locations

The centromers for assembly hg38 are defined in different ways in the UCSC system. According to an explanation by Brian Lee and Christopher Lee of UCSC there now is a specific browser track that “shows the location of Karen Miga’s centromere model sequences. However, these annotations can be smaller than the centromeres shown in the chromosome ideogram and Chromosome Bands track. (…) Depending on your purpose you could use the centromere model regions (red), or the broader Chromosome Bands Ideogram definition of centromere which overlap some annotations (cytoBandIdeo table), or the regions of the assembly that are just NNNNN’s (Gap track).” (link, link)
You can see this location in the screenshot below (from the browser):

As the gap extend is slightly larger than the defined regions, we will have to use the definition of “acen” ideogram regions from the gaps table in most cases where we actually need to use the underlying sequence. The combined centromer regions for hg38 are shown below:

[table id=11 /]


Read-counting for PGT-A

ICSI, source:

This is an introductory article about Pre-Implantation Genetic Testing for Aneuploidy (PGT-A, formerly known as PGS) using sequencing-based read-counting. The procedure is simplified and I am avoiding to talk about specific companies or products here.

Increasing the chances of a successful pregnancy by looking at the embryos chromosomes

During an IVF cycle there are – in most cases – multiple embryos available for transfer to the woman hoping to become pregnant. IVF treatment is expensive and a stressful experience for the patients. The goal of the treatment must therefor be to have success after one or only a few cycles. Factors contributing to success or failure of an IVF cycle are various, many of which are still poorly understood. One of the better-understood factors is the genomic make-up of the embryo(s) transferred: If the cells have a non-standard number of chromosomes, the chances of survival are low for the embryo. 

In order to select an embryo with a high chance of implantation and pregnancy, Pre-Implantation Genetic Testing for Aneuploidy (PGT-A) can be performed. This procedure can assess the genome of an embryo at a high level by identifying the number of chromosomes (or regions of the chromsomes) found in a few cells sampled. Deeper levels of genetic analysis are of course also possible, but the impact of these findings is usually poorly understood and would lead to a selection process that we should stay clear of!

Counting chromosomes by counting sequencing reads

Determining the copy-numbers of every chromosome can be reliably done with sequencing-based approaches nowadays. One way to achieve this is the read-counting method employed by many software solutions and commercial products. The method can be outlined as followed:

  1. Extract and amplify the DNA of a few embryonic cells
  2. Generate a sequencing library 
  3. Perform shallow-depth sequencing
  4. Align the sequencing reads back to the human reference genome
  5. Count how many reads you see in the different regions and chromosomes and
  6. Determine the copy-numbers from this measurement.

This is of course vastly simplified and cannot be discussed at every level of detail here, but we can focus on the last steps a bit further.
The theory we build on is that if the retrieval of the cells, the generation of the sequencing library and the experimental protocol are kept the same, the number of sequencing reads counted in a specific genomic region should only depend on two factors:

A. The characteristics of that genomic region: Some parts of the genome are amplified more rapidly than others and will be overrepresented in the mixture.
B. The amount of input material: If there was an extra copy of the chromosome we should have more DNA going into the sequencing process and also see more reads coming out for this specific region.

Factors underlying point A are mostly based on the characteristics of the amplification reaction mentioned in step 1 above (e.g. which polymerase enzyme is used) and the base-composition of the genome itself. The latter mostly comes into play when we try to place the sequencing reads found back on the reference genome: If the genomic region the read originates from is repeated a few times on different chromosomes, this alignment process cannot reliable determine where to place the read and it has to be discarded. However, as the genome is >99% identical between individual and because the amplification reaction should be performed in a standardised way, this type of bias is relatively constant. Point A above can therefor be analysed beforehand and this amplification and mappability bias can simply be removed.  This leaves us with the differences originating from point B – which are the actual differences that we are trying to assess in the embryo.

In a standard human cell there are 2 copies of each of the autosomes (chromosomes 1 to 22), this becomes our „normal“ level. If we look at the actual numbers of reads counted, we can derive the most likely copy-numbers of the region by comparing them to our normal level:

Theoretical numbers of reads0*50100150200
Assumed copy-number of the region01234

* There will usually be a few reads erroneously found in these regions

Even with very strict protocols there will be variation in read numbers though, making the the assessment more challenging. Additional processing steps can be applied to clean and smooth the data. For this to work it is assumed that most regions are at the “normal” level and that neighbouring measurement point behave similarly in most cases.

Visualising the measured copy-numbers for each region is a key part in the embryo assessment. Autosome read counts outside of the grey shaded area in the copy-number plot above would be considered gained / lost. The sample shown seems to have 3 copies of chromosome 16. Chromosomes X & Y show copy-number 1 as this would be a male sample.

This assessment can now guide the decision which of the embryos might have the highest change of survival bases on its genomic make-up and could be selected for implantation during an IVF cycle.

Microdeletion and Microduplication Syndromes in the Human Genome

Small changes in the sequence of human chromosomes can have detrimental effects in the person’s health and development. The often affect multiple genes but are too small to be visible with traditional karyotyping methods. These changes seem to appear near regions of low complexity (repetitive sequence regions) with some consistency as novel mutations.

The following table shows a list of known syndromes and their genomic locations in hg19 (GRCh37) coordinates. It was compiled by A.Weise et al. in “Microdeletion and Microduplication Syndromes”, Journal of Histochemistry and Cytochemistry (2012).

[table id=8 /]