Notes:

 

Whole genome shotgun sequencing:

Look at the Wikipedia page - it's pretty good, really.

Chromosme walking is the method originally chosen by the public human genome effort. WGS works like this (from Wikipedia):

Strand Sequence
Original XXXAGCATGCTGCAGTCATGCTTAGGCTAXXXX
First shotgun sequence XXXAGCATGCTGCAGTCATGCTXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXTAGGCTAXXXX
Second shotgun sequence XXXAGCATGXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXCTGCAGTCATGCTTAGGCTAXXXX
Reconstruction XXXAGCATGCTGCAGTCATGCTTAGGCTAXXXX

"To apply the strategy, high-molecular-weight DNA is sheared into random fragments, size-selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate vector." (note from MWW - be very careful in using the word "vector".

It is the random nature of the shearing process that makes this work. You will find overlapping sequences if you sequence enough DNA. You will find overlaps only in the smallest sequences.

How much is enough? Coverage: NL/G where N= number of reads; L = read length and G = size of genome.

What are some typical genome sizes?

What did they need? High throughput, automated sequencing; read from both ends; have quality scores to aid assembly.

From Weber and Meyers:

Increasing read lengths and fold coverage dramatically increases contig lengths.

Cost estimates (keep in mind you can now get a whole human genome for $1M

 

In Jan 2007, 400 WGS sequencing projects - yesterday there were 3,006.

 

So, once they showed they could do this with bacteria, flies, and then pretty well with humans - what was next?

Go SAILING!!

What exists on earth? What exists on us, in us, and around us? This could probably only be done with whole genome shotgun sequencing.

Yellowstone hot springs, the ocean, armpits, wooly mammoth, neaderthal man, and acid mine drainage......

 

One way this works is because organisms are not uniformly distributed in the areas sampled.

isolate DNA; size fractionate and clone, sequence; abundance of sequences usually mirrors the abundance of the organism, so you can determine both by phylogeny and relative abundance where genes come from.

 

Let's take a look: Entrez , Metagenome projects

They aren't finding millions of new organisms. What questions would you ask?

You can do functional screens. Express the clones in E.coli and look for interesting functions.

It is easier if you have the whole genome sequence of a close relative. Why?

In these approaches, individual polymorphism is an issue because you will be sequencing from many individuals that haven't been grown in isolation. You run a single genome assembler and then manually post-process the resulting scaffolds to correct assembly errors. Once you have the scaffolds, you bin them based on phylogeny (examine "signatures" like dinucleotide frequencies, codon bias, GC content work for 50kbp scaffolds). If you have known conserved rRNA sequences, you can figure out what you might have with programs like Silva or tetra-nucleotide analysis using Tetra from Max Plank (this is from the bioinformatics paper listed as optional)

 

What are the most compelling metagenomics projects to you?

 

There are 2 papers in eReserves - Diego's Magnaportha paper and a review about the rice genome.

Rice genome - a moncot (dicot/monocot flowering plants). Arabidopsis is the model dicot.

Genome is 389 Mb, 389,000,000 bases - (how many 150kb BAC clones needed?)

Current genome sequence assembly (372 Mb) - what do you think it is missing?

TIGR rice annotation 2007 - 56,278 loci but 6,498 have 10,432 alternative splicing isoforms, the total number of transcripts is 66,710.

15,323 transposable element gene models are included in this - so, if they are removed, there are 41,478 genes!

A total of 33,882 of these gene models have been empirically validated through methods that characterize
RNA transcripts.

Have the function of only a handful of genes. Genome is highly redundant - many copies of genes - makes it very hard to identify gene function.

Cold Spring Harbor database Oryza Map Alignment Project