ORF finding - the value of cDNA libraries and good ORF-finding tools- including BLAST
Gene structure
If we knew all the signals, we could identify genes directly from DNA. Given we don't know all the signals, if full-length cDNAs were easy to get, one library contained the entire genome, and very long sequencing reads were possible, cDNA sequence would be the best way to identify genes.
Using signals:
Open Reading Frame:
the portion of a nucleic acid sequence that encodes a protein -
Gene-Finding Alorithms
As genome sequencing projects have turned out large amounts of DNA sequence of unknown function, it has become more important to be able to ascertain the function of specific regions based only on the sequence
Particular interest is finding the protein coding regions in the DNA
Genome Size Comparisons
Organism
Genome size # of genes Genetic unit
(in megabases)(average of gene size)
Prokaryota: Mycoplasma genitalium
0.58 473 1235 bp
Haemophilus influenzae
1.8 1,709 1042 bp Escherichia coli 4.6 4,288 Myxococcus snathus 9.5 8,000 Archea: Methanococcus jannaschii 1.7 1,738 Eukaryota: Saccharomyces cerevisiae 1.3 6,241 2,100 bp Neurospora crassa 42.9 10,000 - 13,000 3,000 - 4,000 bp Drosophila melanogaster 165 13,601 10,000 bp Caenorhabditis elegans 100 18,424 Homo sapiens 2,910 30,000 - 40,000 Arabidopsis thaliana 125 25,498 ORF finding:
Prokaryotic genes are relatively
easy to locate:
promoters are relatively well characterized
long open reading frames suggests the presence of genes
Eukaryotic genes are much more
difficult to locate:
many types of transcription factor binding sites
they work in conjunction with each other
most of them are not well characterized
because of splicing, open reading frames tend to be shorter and splice sites
are difficult to locate
Finding procaryotic genes:
Quite different from in eukaryotic
sequences due to
promoters are relatively well characterized
the higher gene density typical of prokaryotes
absence of introns in their protein coding genes
These properties generally imply that most likely ORFs in a prokaryotic sequence are longer than some reasonable threshold, such as 300 or 500 base pairs
Primary difficulties;
Very small genes will be missed
shadow genes (overlapping long ORFs on opposite DNA strands)
readthrough ORFs
To solve these problems, several methods have been devised that used different types of Markov models in order to capture the compositional differences among coding regions, shadow coding regions and noncoding DNA.
ECOPARSE, GENMARK and Glimmer
appear to be able to identify most protein coding genes with good specificity,
but still have difficulties in predicting the precise position of the start
of translation.
Eucaryotic gene finding:
Types of exons
initial exons (initiation codon to 5 splice site)
internal exons (3 splice site to 5 splice site)
terminal exons (3 splice site to stop codon)
single-exon (intronless) genes (initiation codon to stop codon)
These four types of exons present different challenges for gene-finding methods, and the methods differ significantly in their ability to predict the four exon types.
The most natural way to find
genes computationally would be to mimic as closely as possible the processes
of transcription and RNA processing (splicing and polyadenylation) that define
genes biologically.
Signals in the sequence: particular
sites that are necessary
for coding the protein
Transcriptional signals
Translational signals
Splicing signals
Content measures: non-random
nature of the coding and non-coding regions themselves
codon bias: TestCode statistic (Ficketts)
The transcriptional signals most
often used in gene finding are
the initiator or cap signal located at the transcription start site
the A+T-rich TATA-box signal, typically located about 30 bp upstream of the
transcription start site (TSS).
Other features known to play a
role in promoter function such as
transcriptional enhancers and silencers
Polyadenylation signal (a consensus AATAAA hexamer sequence followed by a
more complex signal (not yet characterized) located 20 to 30 bp downstream).
Using simple weight matrix descriptions
of the Kozak and translation termination signals in the context of the integrated
gene-finding program GENSCAN, about two third (66%) of translation initiation
sites and about three quarters (78%) of termination codons have been correctly
predicted.
Splicing signals:
5 and 3 ends of the
intron (the donor and acceptor splice sites, respectively)
an internal site known as the branch point. GT-AG
With a few interesting exceptions,
virtually all spliceosomal introns begin with GT and end with AG, and this
nearly invariant rule is used by the majority of gene-finding programs to
narrow the search space of possible exon and intron boundaries.
Fickett's test code statistic:
examines all 6 open reading frames
creates the complement to the sequence submitted
only sees ORFs only if there is a methionine (ATG) start codon
a problem with EST (partial cDNA) sequences as the ATG might not be in the part of the gene being studied
Using cDNAs to identify genes:
TIGR Human Gene Index (includes Tentative Human Consensus (THC) sequences, assembled using the TIGR assembler
TIGR EGAD - Expressed Gene Anatomy Database