GAL file format

Posted Posted in microarrays, screening

GenePix Array List (GAL) files are text files with specific information about the location, size, and name of each DNA spot on a microarray. They are therefor of vital importance for the analysis of scanned microarray images. The format defines a specific header before the list of data columns follows:


ATF	1			

9	5			

Type=GenePix ArrayList V1.0				



"Block1=10000, 38780, 150, 20, 200, 18, 200"				


ArrayerSoftwareName=TAS Application Suite (MicroGrid II)				



Block	Column	Row	ID	Name

1	1	1	RP11-163J21	Clone 1

1	1	2	RP11-163J21	Clone 2


ATF -> File conforms to Axon Text File
1 -> Version number of ATF
9 -> Number of header lines before the “Block, Column, Row, …” line
5 -> Number of data columns (Block, Column, Row, Name, ID)
Type=GenePix ArrayList V1.0 -> Type of file, same for all GAL files
Block Count=1 -> Number of blocks described in the file
Block Type=0 -> Type of block, 0 = rectangular Block
X=A, B, C, D, E, F, G -> The position and dimensions of each block.
A -> xOrigin
B -> yOrigin
C -> Feature diameter
D -> xFeatures
E -> xSpacing
F -> yFeatures
G -> ySpacing ScanResolution – Optional parameter to scale the position on higher-resolution images Block arrangement

1	2	3	4

5	6	7	8

9	10	11	12

13	14	15	16

The data columns are:

  • Block
  • Column
  • Row
  • Name
  • ID

Further reading and sources:

Command-line NGS data munging

Posted Posted in bioinformatics, sequencing

These are various useful commands to process sequencing data files as created by Illumina machines.
Inspired by various sources on the internet and own tasks.

Split Fastq files with combined paired-end data into two separate file:

awk '{if(NR%4==1){ if($0 ~ /\/1$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R1.fastq
awk '{if(NR%4==1){ if($0 ~ /\/2$/){p=1; print $0} else{p=0}} else{ if(p==1){print $0}} }' sample.pairs.fastq > sample.R2.fastq

Convert fastq file to fasta in a single line:

sed '/^@/!d;s//>/;N' sample1.fq > sample1.fa

Convert multi-line fasta to single-line fasta:

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa
Extract sequences by their ID from fasta file:
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' sample1.fa

For file with IDs:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sample1.fa

BWA mapping (using piping for minimal disk I/O)

bwa aln -t 8 targetGenome.fa reads.fastq | bwa samse targetGenome.fa - reads.fastq | samtools view -bt targetGenome.fa - | samtools sort - reads.bwa.targetGenome
samtools index reads.bwa.targetGenome.bam

Retain only uniquely mapping reads from bwa alignment

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > reads.uniqueMap.bam

Laboratory Tests under CLIA

Posted Posted in health, regulatoray

Congress passed the Clinical Laboratory Improvement Amendments (CLIA) act in 1988 to establish quality standards for all non-research laboratory testing:

  1. Performed on specimens derived from humans; and
  2. For providing information for the diagnosis, prevention, and treatment of disease or impairment, or assessment of health.

The objective of the CLIA  program is to ensure quality in laboratory testing procedures and specifically to establish quality standards to ensure the accuracy, reliability, and timeliness of the patient’s test results. The CLIA Quality System Regulations became effective on April 24, 2003. Now the laboratory is required to check (verify) the manufacturer’s performance specifications provided in the package insert for:

  • Accuracy: If test results for previously tested samples fall within the stated acceptable limits, accuracy is verified
  • Precision: Can the results be repeated mulitple times on the same day and on different days by different operators.
  • Reportable range: Use known samples to confirm the upper and lower limits of the test.
  • Also: Reference range or interval: Do the reference ranges provided by the test system’s manufacturer fit your patient population?

The number of samples needs to be established for every test, 20 samples are seen as a “rule of thumb”.

The FDA defines a Laboratory Developed Test (LDT) as an in vitro diagnostic test that is manufactured by and used within a single laboratory (i.e. a laboratory with a single CLIA certificate). LDTs are also sometimes called in-house developed tests, or “home brew” tests. Similar to other in vitro diagnostic tests, LDTs are considered “devices,” as defined by the FFDCA, and are therefore subject to regulatory oversight by FDA.

Sources:Centers for Medicare & Medicaid Services, Genohub

A DNA Sequencing History

Posted Posted in genomics, sequencing

Major landmarks in DNA sequencing and molecular biology

Strukturformel eines DNA-Ausschnittes (Wikipedia)

Discovery of the structure of the DNA double helix (Watson, Crick, Franklin).

Prove the semi-conservative nature of dna replication (Meselson, Stahl)

First dna triplet is decoded (Matthei, Nierenberg)

Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the only accessible samples for sequencing were from bacteriophage or virus DNA.

The first gene is sequenced

The first complete DNA genome to be sequenced is that of bacteriophage φX174

Allan Maxam and Walter Gilbert publish “DNA sequencing by chemical degradation” [4].
Fred Sanger, independently, publishes “DNA sequencing by enzymatic synthesis”.

Fred Sanger and Wally Gilbert receive the Nobel Prize in Chemistry

Genbank starts as a public repository of DNA sequences.

Andre Marion and Sam Eletr from Hewlett Packard start Applied Biosystems in May, which comes to dominate automated sequencing.

Akiyoshi Wada proposes automated sequencing and gets support to build robots with help from Hitachi.

Restriction fragment length polymorphism fingerprinting (Jeffreys)

Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb.

Kary Mullis and colleagues develop the polymerase chain reaction, a technique to replicate small fragments of DNA

Leroy E. Hood’s laboratory at the California Institute of Technology and Smith announce the first semi-automated DNA sequencing machine.

Applied Biosystems markets this first automated sequencing machine, the model ABI 370.

Walter Gilbert leaves the U.S. National Research Council genome panel to start Genome Corp., with the goal of sequencing and commercializing the data.

The U.S. National Institutes of Health (NIS) begins large-scale sequencing trials on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae (at 75 cents (US)/base).

BLAST algorithm for aligning sequences published (Lipman, Myers).

Capillary electrophoresis published (Barry Karger, Lloyd Smith, Norman Dovichi).

Official start of the Human Genome Project

Craig Venter develops strategy to find expressed genes with ESTs (Expressed Sequence Tags).

Uberbacher develops GRAIL, a gene-prediction program.

Craig Venter leaves NIH to set up The Institute for Genomic Research (TIGR).

William Haseltine heads Human Genome Sciences, to commercialize TIGR products.

Wellcome Trust begins participation in the Human Genome Project.

Simon et al. develop BACs (Bacterial Artificial Chromosomes) for cloning.

First chromosome physical maps published:
-Page et al. – Y chromosome[28];
-Cohen et al. chromosome 21[29].
-Lander – complete mouse genetic map[30];
-Weissenbach – complete human genetic map[31].

Wellcome Trust Sanger Institute (original file)

Wellcome Trust and MRC open Sanger Centre, near Cambridge, UK.

The GenBank database migrates from Los Alamos (DOE) to NCBI (NIH).

Venter, Fraser and Smith publish first sequence of free-living organism, Haemophilus influenzae (genome size of 1.8 Mb).

Richard Mathies et al. publish on sequencing dyes (PNAS, May)[32].

Michael Reeve and Carl Fuller, thermostable polymerase for sequencing[8].

International HGP partners agree to release sequence data into public databases within 24 hours.

International consortium releases genome sequence of yeast S. cerevisiae (genome size of 12.1 Mb).

Yoshihide Hayashizaki’s at RIKEN completes the first set of full-length mouse cDNAs.

Blattner, Plunkett et al. publish the sequence of E. coli (genome size of 5 Mb)[33]

First cloned animal, Sheep “Dolly”, is born (Wilmut)

Phil Green and Brent Ewing of Washington University publish ìphredî for interpreting sequencer data (in use since ë95)[34].

Venter starts new company (Celera), will sequence HG in 3 yrs for $300m.

Wellcome Trust doubles support for the HGP to $330 million for 1/3 of the sequencing.

NIH & DOE goal: “working draft” of the human genome by 2001.

Sulston, Waterston et al finish sequence of C. elegans (genome size of 97Mb)[35].

NIH moves up completion date for rough draft, to spring 2000.

NIH launches the mouse genome sequencing project.

First sequence of human chromosome 22 published[36].

Celera and collaborators sequence fruit fly Drosophila melanogaster (genome size of 180Mb) – validation of Venter’s shotgun method. HGP and Celera debate issues related to data release.

HGP consortium publishes sequence of chromosome 21.[37]

HGP & Celera jointly announce working drafts of HG sequence, promise joint publication.

Estimates for the number of genes in the human genome range from 35,000 to 120,000.

International consortium completes first plant sequence, Arabidopsis thaliana (genome size of 125 Mb).

HGP consortium publishes Human Genome Sequence draft in Nature (15 Feb)[38].

Celera publishes the Human Genome sequence[39].

HapMap project initiated to decipher human genetic variation

420,000 VariantSEQr human resequencing primer sequences published on new NCBI Probe database.

Genographic project launched to study human migration

A set of closely related species (12 Drosophilidae) are sequenced, launching the era of phylogenomics.

Craig Venter publishes his full diploid genome


Source: Wikipedia and ABI

Epigenetics and Epigenomics

Posted Posted in genomics

The human DNA sequence has been read, now we know how the genome works and how to detect and cure genetic diseases, don’t we?

Unfortunately not – or fortunately if you are working in this area. While we know the sequence of bases for a number of reference and other genomes, not only are we far from knowing and understanding all the variations that can be found between different people and the consequences of the variations – but there are also other layers of information in the genome that we are only starting to understand. I am talking about the field of epigenetics here, which looks at molecular “tags” that are attached to the DNA at certain places and play a key role in activation or deactivation of the genes in these places. In contrast to the actual DNA sequence these markers are reversible and get altered during embryonic development and differentiation, i.e. when cells develop into a specific cell types, e.g. a skin cell. They also get modified in a less fortunate way as we get old and in certain disease conditions such as diabetes, inflammation or cancer. The study of these tags  is called epigenetics, or epigenomics when applied to the entire human genome.

More specifically, these tags are molecular modifications, mostly methyl-groups that can be attached usually to the Cytosil DNA base and to histones, the proteins that the DNA is wrapped around to “get in shape”.

Sources and useful links: