Command-line NGS data munging

May 4, 2017

bioinformatics sequencing

These are various useful commands to process sequencing data files as created by Illumina machines.
Inspired by various sources on the internet and own tasks.

Split Fastq files with combined paired-end data into two separate file:

awk '{if(NR%4==1){ if($0 ~ /\/1$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R1.fastq
awk '{if(NR%4==1){ if($0 ~ /\/2$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R2.fastq

Convert fastq file to fasta in a single line:

sed '/^@/!d;s//>/;N' sample1.fq > sample1.fa

Convert multi-line fasta to single-line fasta:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

Extract sequences by their ID from fasta file:

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' sample1.fa

For file with IDs:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sample1.fa

BWA mapping (using piping for minimal disk I/O)

bwa aln -t 8 targetGenome.fa reads.fastq | bwa samse targetGenome.fa - reads.fastq | samtools view -bt targetGenome.fa - | samtools sort - reads.bwa.targetGenome
samtools index reads.bwa.targetGenome.bam

Retain only uniquely mapping reads from bwa alignment

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > reads.uniqueMap.bam

Felix Kokocinski

Related Articles

Unlocking Business Potential: Exploring the Value of External Bioinformatics Consultants

SAM format summary

Decoding REST: Simplifying Interconnected IT Systems