Discovery of functional elements in the Drosophila genome using evolutionary signatures
Stark, et al.
What did they want to do?
They wanted to improve/determine new methods for extracting information about genes directly from the DNA sequence: exon/introns, RNA, regulatory elements, miRNA, etc.
How might you go about doing this? What organism(s) would you choose?

How did we used to do it?
People looked for highly conserved regions of DNA and assume they are important (coding/promoter site, etc).
Problems: What are some problems with that method?
?
What are “evolutionary signatures?”
Protein coding
Specific codon substitution frequency patterns
Insertion/deletions more often a multiple of three (why?)
RNA genes
Common mutations are ones that conserve base pairing
MicroRNA
Conserved in the stem, variable in the loop regions (draw it out)

Why is it difficult to determine genes in metazoans (= “animals”)?
Multiple short exons
Alternative splicing
Complex gene structure
Goal: To produce a computational method for gene recognition and annotation of protein-coding genes that rivals Flybase
What methods did they use to overcome these issues?
Compared their results to Flybase: 13,733 genes

How well did these methods work?

What did this automated curation contribute?
1,193 new putative coding exons
192 new gene models
438 revised gene models
120/184 = 65% cDNA confirmed predictions
83% were incorporated into Flybase
Did it discover novel protein-coding gene features?
Polycistronic transcripts (207 genes)
Stop-codon readthrough (149 genes) (not selenoproteins)
Ribosomal release factors, A to I editing, alternative splicing
“Programmed” frameshifts (swtich reading frames with no intron) (4 genes)

Works pretty well with protein coding genes. What about RNA genes?
These sequences aren’t translated into proteins, so why should we care? What do non-translated RNAs do?
?
How do we find ncRNAs?
Double substitutions of paired nucleotides
G.U pair structure preservation
394 predicted genes – 54% in intergenic regions, 32% in introns and 11% in 3’ UTRs and 3% in 5’ UTR, 200 in protein-coding regions
UTRs more commonly on transcribed strand
What are some possible functions for a UTR mRNA structure?

mRNA features
A to I editing (adenosine to inosine (deamination))
UnTranslatedRegion (UTR) structures (2x more RNA motifs than previously known)
Many in regulatory genes (regulation of regulation)
Many implicated in localization/timing (oogenesis)
How about miRNA?
Algorithms were designed to focus on specific RNA classes – increased accuracy
101 miRNA structures

How did they verify them?
Compared putative sequences to RNA short sequence libraries
Found them in introns
Caveat – 30 sequences were not found computationally
What were the miRNA signatures?
Direct -> on the 5’ end
Indirect -> homology with target sequence
Novel miRNAs were found with less efficiency, but were also shown to be processed less efficiently
What about other regulatory motifs? (transcription factors)
Checked conservation levels with various methods (Motif Excess Conservation)
145 motifs discovered
How did this compare to Flybase?
46% of known factors found, Factors missed were not highly conserved, or occurred only once

What else might we ask about the motif?
Tissue enrichment
75% of motifs were enriched or depleted in a specific tissue
Motifs more frequent in specific tissues (e.g. ME93 in neuroblasts)
What is their location in the genome?
Near transcription start sites (10%)
Some have specific location/orientation (promoter vs. enhancer)
What level are these conserved elements regulating at?
Post-transcriptional regulation – if it were to act on the RNA, which strand would you expect it to occur on?

And motifs in protein-coding elements?
Overlapping selective pressures (what pressures are there in a protein-coding region?)
Looked for invariant-frame searches (avoid reading-frame bias)
Many discovered motifs paired with non-3’ UTR sequences
What if a motif is not part of a group?
Used ChIP data to confirm
Great, I want to do this with my favorite organism. Can I get the information I want?
More species used = better prediction
More clades = better prediction
Degree of relatedness = depends what you’re looking for (long, medium, short)

Identification and Analysis of functional elements in 1% of the human genome by the ENCODE pilot project
ENCODE = Encyclopedia Of DNA Elements
What did they want to do?
Obtain more detailed information on protein-coding human DNA
What would you do if you could gather information on protein coding DNA? What questions would you ask?
?
How did they approach this?
Looked at ~30,000 kb of DNA (1% of genome)
~1/2 in well-studied regions
~1/2 in randomly chosen coding regions
Used redundancy to confirm data/confirm methods
What different levels might you want to think about if you were studying the genome?

Transcription: What did they discover?
Techniques:
Hybridization to unbiased tiling arrays (txFrags)
Tag sequence of 5’/3’ cap-selected RNA (CAGE/PETS)
cDNA and EST databases (GENCODE)
How much of the DNA is transcribed?
14.7% of bases (unbiased tiling arrays)
And all of that is translated, right?
Just 47%
Are these transcripts expressed this way universally?

So much unannotated transcription. What would you ask about it?
?


RACE = rapid amplification of DNA ends (amplification from internal position to 5' end)

Are pseuodgenes transcribed?
At least 19%, (TxFrag, RACE tiling)
What about functional RNAs that don’t encode proteins (ncRNAs)?
EvolFold and RNAz
269 potential loci, 56-63% confirmed (out of 50, using RACE/tiling)
Unspliced Transcripts (figure 4)
93% of bases found to be transcribed via two independent observations
Complex/intercalcated transcription appears common

So transcription is more complex than we assumed. What would you explore next?

http://www.vincibiochem.it/images/chip_flow.jpg
Where does transcription start? (Table 3)
Chromatin structure/Transcription Factor binding
Identified inter-histone regions (DNAse1 sensitive)
Supported the novel transcription start sites
The regulatory sequences aren’t biased toward the upstream region around TSS
TSS were correlated with active chromatin/CPG islands
DNaseI hypersensitive sites that were far from the gene show consistent histone modification patterns (some seem to be insulators).
Regulatory factor binding was enriched (skewed) toward the 5’ end of transcripts

What about regulatory factor localization? Is it random?
Looked for clusters: Found 25% of clusters around transcription start sites (TSS)

Can we predict where transcription will occur based on chromatin structure?
They modeled TSSs based on DHS information with 83% accuracy
On predicting new TSSs with this model, 74% (of 110) were near a novel TSS
For outright expression levels, the model predicts with 91% accuracy if an area will be expressed (Cpg island information did not improve predictions)
Summary of transcription data:
The number of modeled transcription start sites ~10x those of known genes
Overall transcription levels are greater than expected
They identified many regulation elements that may be distal from TSS
They fairly successfully modeled transcription based on chromatin structure
So now we’ve studied DNA being transcribed into RNA. How about DNA to DNA? Can we model that with chromatin structure?

DNA structure appears to be highly predictive and therefore quite important. Is histone modification correlated to gene expression/DNaseI hypersensitivity?
H3K4me1, 2, and HCa3 vs. DNaseI hypersensitivity
Different correlations depending on the scale

Since all of these reactions are going on in the nucleus simultaneously, are they all correlated in the grand scheme of things?

Fantastic, all these elements fit together! But how did it get to be this way? How are all these factors produced/constrained by evolution?
Given how important this all is, do you think most of these functional elements will be conserved? How widely?
?


If known functional regions aren’t constrained, then what about the rest of the ENCODE region?
40% of nucleotides are constrained, but have no known function
Why do you think we might see these results?
?
Conclusions
A wealth of data can be obtained from such higher-level analyses
Always check your assumptions/it’s important to think outside the box and to listen to your data!