These are various useful commands to process sequencing data files as created by Illumina machines.
Inspired by various sources on the internet and own tasks.
Split Fastq files with combined paired-end data into two separate file:
awk '{if(NR%4==1){ if($0 ~ /\/1$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R1.fastq awk '{if(NR%4==1){ if($0 ~ /\/2$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R2.fastq
Convert fastq file to fasta in a single line:
sed '/^@/!d;s//>/;N' sample1.fq > sample1.fa
Convert multi-line fasta to single-line fasta:
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa
Extract sequences by their ID from fasta file:
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' sample1.fa
For file with IDs:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sample1.fa
BWA mapping (using piping for minimal disk I/O)
bwa aln -t 8 targetGenome.fa reads.fastq | bwa samse targetGenome.fa - reads.fastq | samtools view -bt targetGenome.fa - | samtools sort - reads.bwa.targetGenome samtools index reads.bwa.targetGenome.bam
Retain only uniquely mapping reads from bwa alignment
samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > reads.uniqueMap.bam