Discovery of functional elements in the Drosophila genome using evolutionary signatures
      Stark, et al.

What did they want to do?
            They wanted to improve/determine new methods for extracting information about genes directly from the DNA sequence: exon/introns, RNA, regulatory elements, miRNA, etc.

How might you go about doing this? What organism(s) would you choose?

 

How did we used to do it?
            People looked for highly conserved regions of DNA and assume they are important   (coding/promoter site, etc).
Problems: What are some problems with that method?
?
What are “evolutionary signatures?”
Protein coding
           Specific codon substitution frequency patterns
           Insertion/deletions more often a multiple of three (why?)
RNA genes
           Common mutations are ones that conserve base pairing
MicroRNA
           Conserved in the stem, variable in the loop regions (draw it out)

 

 Why is it difficult to determine genes in metazoans (= “animals”)?
            Multiple short exons
            Alternative splicing
            Complex gene structure

Goal: To produce a computational method for gene recognition and annotation of protein-coding genes that rivals Flybase

 What methods did they use to overcome these issues?

  1. Reading frame conservation (RFC) (insertions/deletions)
  2. Codon Substitution Frequencies (CSF) (synonymous/conservative aa changes)

Compared their results to Flybase: 13,733 genes


How well did these methods work?


What did this automated curation contribute?
            1,193 new putative coding exons
            192 new gene models
            438 revised gene models
            120/184 = 65% cDNA confirmed predictions
            83% were incorporated into Flybase

Did it discover novel protein-coding gene features?
            Polycistronic transcripts (207 genes)
            Stop-codon readthrough (149 genes) (not selenoproteins)
            Ribosomal release factors, A to I editing, alternative splicing
            “Programmed” frameshifts (swtich reading frames with no intron) (4 genes)

Works pretty well with protein coding genes. What about RNA genes?
These sequences aren’t translated into proteins, so why should we care? What do non-translated RNAs do?
?

How do we find ncRNAs?
            Double substitutions of paired nucleotides
            G.U pair structure preservation
            394 predicted genes – 54% in intergenic regions, 32% in introns and 11% in 3’ UTRs and 3%         in 5’ UTR, 200 in protein-coding regions
            UTRs more commonly on transcribed strand
What are some possible functions for a UTR mRNA structure?

mRNA features
A to I editing (adenosine to inosine (deamination))
UnTranslatedRegion (UTR) structures (2x more RNA motifs than previously known)
Many in regulatory genes (regulation of regulation)
Many implicated in localization/timing (oogenesis)

How about miRNA?
            Algorithms were designed to focus on specific RNA classes – increased accuracy
            101 miRNA structures

 

How did they verify them?
            Compared putative sequences to RNA short sequence libraries
                        Found them in introns
            Caveat – 30 sequences were not found computationally
What were the miRNA signatures?
            Direct -> on the 5’ end
            Indirect -> homology with target sequence
            Novel miRNAs were found with less efficiency, but were also shown to be processed less efficiently

What about other regulatory motifs? (transcription factors)
            Checked conservation levels with various methods (Motif Excess Conservation)
            145 motifs discovered
How did this compare to Flybase?
            46% of known factors found, Factors missed were not highly conserved, or occurred only once

What else might we ask about the motif?
            Tissue enrichment
                        75% of motifs were enriched or depleted in a specific tissue
            Motifs more frequent in specific tissues (e.g. ME93 in neuroblasts)

What is their location in the genome?
            Near transcription start sites (10%)
            Some have specific location/orientation (promoter vs. enhancer)

What level are these conserved elements regulating at?
            Post-transcriptional regulation – if it were to act on the RNA, which strand would you expect it to occur on?


 
And motifs in protein-coding elements?
            Overlapping selective pressures (what pressures are there in a protein-coding region?)
            Looked for invariant-frame searches (avoid reading-frame bias)
            Many discovered motifs paired with non-3’ UTR sequences

What if a motif is not part of a group?
            Used ChIP data to confirm

Great, I want to do this with my favorite organism. Can I get the information I want?
            More species used = better prediction
            More clades = better prediction
            Degree of relatedness = depends what you’re looking for (long, medium, short)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Identification and Analysis of functional elements in 1% of the human genome by the ENCODE pilot project

ENCODE = Encyclopedia Of DNA Elements

What did they want to do?
            Obtain more detailed information on protein-coding human DNA

What would you do if you could gather information on protein coding DNA? What questions would you ask?
?

How did they approach this?
            Looked at ~30,000 kb of DNA (1% of genome)
            ~1/2 in well-studied regions
            ~1/2 in randomly chosen coding regions
            Used redundancy to confirm data/confirm methods

What different levels might you want to think about if you were studying the genome?

Transcription: What did they discover?
            Techniques:
            Hybridization to unbiased tiling arrays (txFrags)       
Tag sequence of 5’/3’ cap-selected RNA (CAGE/PETS)                     
            cDNA and EST databases (GENCODE)

How much of the DNA is transcribed?
            14.7% of bases (unbiased tiling arrays)
And all of that is translated, right?
            Just 47%
            Are these transcripts expressed this way universally?

 

 

So much unannotated transcription. What would you ask about it?
?

 


RACE = rapid amplification of DNA ends (amplification from internal position to 5' end)

Are pseuodgenes transcribed?
            At least 19%, (TxFrag, RACE tiling)

What about functional RNAs that don’t encode proteins (ncRNAs)?
            EvolFold and RNAz
            269 potential loci, 56-63% confirmed (out of 50, using RACE/tiling)

Unspliced Transcripts (figure 4)
            93% of bases found to be transcribed via two independent observations
            Complex/intercalcated transcription appears common

So transcription is more complex than we assumed. What would you explore next?


http://www.vincibiochem.it/images/chip_flow.jpg

Where does transcription start? (Table 3)

Chromatin structure/Transcription Factor binding
            Identified inter-histone regions (DNAse1 sensitive)
            Supported the novel transcription start sites
            The regulatory sequences aren’t biased toward the upstream region around TSS
            TSS were correlated with active chromatin/CPG islands
                  DNaseI hypersensitive sites that were far from the gene show consistent histone                          modification patterns (some seem to be insulators).
                  Regulatory factor binding was enriched (skewed) toward the 5’ end of transcripts

What about regulatory factor localization? Is it random?
            Looked for clusters: Found 25% of clusters around transcription start sites (TSS)

 

Can we predict where transcription will occur based on chromatin structure?
            They modeled TSSs based on DHS information with 83% accuracy
            On predicting new TSSs with this model, 74% (of 110) were near a novel TSS
            For outright expression levels, the model predicts with 91% accuracy if an area will be         expressed (Cpg island information did not improve predictions)

Summary of transcription data:
            The number of modeled transcription start sites ~10x those of known genes
            Overall transcription levels are greater than expected
            They identified many regulation elements that may be distal from TSS
            They fairly successfully modeled transcription based on chromatin structure

So now we’ve studied DNA being transcribed into RNA. How about DNA to DNA? Can we model that with chromatin structure?

 

DNA structure appears to be highly predictive and therefore quite important. Is histone modification correlated to gene expression/DNaseI hypersensitivity?
            H3K4me1, 2, and HCa3 vs. DNaseI hypersensitivity
            Different correlations depending on the scale

Since all of these reactions are going on in the nucleus simultaneously, are they all correlated in the grand scheme of things?

Fantastic, all these elements fit together! But how did it get to be this way? How are all these factors produced/constrained by evolution?

Given how important this all is, do you think most of these functional elements will be conserved? How widely?
?

 

If known functional regions aren’t constrained, then what about the rest of the ENCODE region?
            40% of nucleotides are constrained, but have no known function
 
Why do you think we might see these results?
?

Conclusions
            A wealth of data can be obtained from such higher-level analyses
            Always check your assumptions/it’s important to think outside the box and to listen to your data!