bioinformaticssequencing

Command-line NGS data munging

These are various useful commands to process sequencing data files as created by Illumina machines.
Inspired by various sources on the internet and own tasks.

Split Fastq files with combined paired-end data into two separate file:

awk '{if(NR%4==1){ if($0 ~ /\/1$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R1.fastq
awk '{if(NR%4==1){ if($0 ~ /\/2$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R2.fastq

Convert fastq file to fasta in a single line:

sed '/^@/!d;s//>/;N' sample1.fq > sample1.fa

Convert multi-line fasta to single-line fasta:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa
Extract sequences by their ID from fasta file:
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' sample1.fa

For file with IDs:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sample1.fa

BWA mapping (using piping for minimal disk I/O)

bwa aln -t 8 targetGenome.fa reads.fastq | bwa samse targetGenome.fa - reads.fastq | samtools view -bt targetGenome.fa - | samtools sort - reads.bwa.targetGenome
samtools index reads.bwa.targetGenome.bam

Retain only uniquely mapping reads from bwa alignment

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > reads.uniqueMap.bam