Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Big Data with SQLite

I recently wanted to process genome-wide sequence data that I had extracted into 24 files ranging in size from 3 GB (Gigabytes) for the small chromosome Y to 26 GB for the largest chromosome 1. Further processing this data was challenging as any software script reading the files has to do it in a stream, i.e. line by line or in chunks – or use immense amounts of memory. The actual task was to extract the set of sequences from all chromosome files that are unique across the genome. Command-line tools to sort / make unique / combine would be running for many days if they would finish at all.

The solution was to make use of a database management system (DBMS). There are more powerful DBMS, but the free and simple-to-use file-based database system SQLite can handle a lot of data as well. Theoretically its file can grow up to 140 terabytes and more (ref). I had to optimize the setup in the following ways:

1. Install the latest SQLite version, compiling it as described here. For the Mac OS X it included the following:

conda install -c conda-forge fossil
fossil clone https://www.sqlite.org/src sqlite.fossil
fossil open sqlite.fossil
fossil update release
mkdir bld
cd bld
../configure
make
make sqlite3.c
make test

2. Add the “virtual tables” extension, compiling csv.c from https://github.com/sqlite/sqlite/tree/master/ext/misc

gcc -g -I. -fPIC -dynamiclib /home/felix/sqlite_extensions/csv.c -o /home/felix/sqlite_extensions/csv.dylib

3. When setting up the database, use PRAGMA settings to improve the performance of the DBMS and then load the extension:

sqlite3 sequences.db
>PRAGMA main.page_size = 4096;
>PRAGMA main.cache_size = 10000;
>PRAGMA main.locking_mode = EXCLUSIVE;
>SELECT load_extension('/home/felix/sqlite_extensions/csv.dylib');

4. Load the csv data into a virtual / temporary table first – this is extremely fast. Perform any operations on this table while inserting the data into a final table in the database. This could for example involve pulling out only unique sequences:

>CREATE VIRTUAL TABLE temp.t1 USING csv(filename="/home/felix/data/chrom1.csv");
>CREATE TABLE sequences (
    seqid INTEGER PRIMARY KEY,
    seqstring VARCHAR NOT NULL
);
INSERT INTO sequences(seqstring) SELECT DISTINCT c0 from t1;

If you need indices, add them afterwards!
Trying to remove duplicates by loading into a table with a “UNIQUE” column was too slow, btw.

5. The next steps are then to load data from other chromosomes into this table “sequences” and then pull them out into a new table “unique_sequences” to remove duplicates between different chromosomes as well:

>DROP TABLE temp.t1;
>CREATE VIRTUAL TABLE temp.t1 USING csv(filename="/home/felix/data/chrom2.csv");
>CREATE TABLE sequences (
   seqid INTEGER PRIMARY KEY,
   seqstring VARCHAR NOT NULL
);
INSERT INTO sequences(seqstring) SELECT DISTINCT c0 from t1;

# and so on for other chromosomes
# ...

>CREATE TABLE unique_sequences (
    seqid INTEGER PRIMARY KEY,
    seqstring VARCHAR NOT NULL
);
INSERT INTO unique_sequences(seqstring) SELECT DISTINCT seqstring from sequences;

In my case it was better to use a number of intermediate databases to be able to run the processing in parallel and to keep the file size smaller. The data from the different databases can then be combined into the final db with statements like this:

sqlite3 sequences.db
>PRAGMA main.page_size = 4096;
>PRAGMA main.cache_size = 10000;
>PRAGMA main.locking_mode = EXCLUSIVE;

>ATTACH DATABASE 'sequences.1.db' AS db1;
>INSERT INTO unique_sequences(seqstring) SELECT DISTINCT seqstring from db1.unique_sequences;

The final sqlite database was around 150 GB, but the performance was fine to extract sequences as needed.