Chapter 2

Whole genome sequencing

Review sequencing

Look for homology (BLAST and other programs)

Gene structure - but this isn't what BLAST looks for

Homologs: Orthologs, paralogs are subsets of homologs - imply evolutionary relationships

Gene annotation - what is known. Process, function, localization, etc.

Phylogenetic trees

 

Sequence Similarity Searching:

Why and How

Why is sequence useful?

If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.

If you do genomics:

  • Tells you about what the organism can do,
  • Tells you about the history of the organism
  • gives you leverage for understanding how the organism works
  • you can find out more about structure, function, and evolution
  • you can learn information that is in the sequence that you might not have guessed
    • e.g. there are highly conserved non-coding regions
    • synteny - what is this?
    • some organisms do not allow repeats, some are full of transposable elements
    • lateral gene transfer appears frequent in bacteria and archea and less frequent in eukaryotes
    • some genomes have duplicated and then genes diverge to produce new functions
    • some genomes have many spliced mRNAs, others don't
    • what else?
      • long interspersed elements (LINES) - 1-5kb long; 10 - 10K copies per genome
      • Pseudogenes
      • mini and micro satellites - 14-500 bp/up to 13 bp in 1-5kb/100s of kbs long.
      • telomeres - short repeat unit (6 bp - TTAGGG in humans); 250-1000 repeats per chromosome end

How is sequencing done?

    PCR-based sequencing also called di-deoxy sequencing

    • fragment length is an issue
    • algorithms for assembly
    • different approaches to scaffolds
    • challenges of repeat sequences
    • telomeres and centromeres
    • what is "finished" sequence?
    • cost has gone from $10 per base in 1085 to 0.1 cent per base in 2006.

    New types of sequencing being developed all the time - want to get to $1,000 Genome (we'll talk about this next time) - what is the cost per base of this for a human genome?

What do you do with sequence? If you are just looking at one or a few genes, there are global (Smith-Waterman) and local alignments.

Global versus local alignments

  • Dot plots (Have been doing this for 30 years!)
  • Used for analysis of gene structure and genome organization, detection of internal sequence repeats, RNA folding, molecular evolution
  • Dots are placed at the intersection of each row and column where the bases or amino acids are identical
  • Sequence similarity searching algorithms have resulted from a dialectic process of iterative improvement and refinement.

BLAST tutorial from Geospiza and Geospiza tutorial site (these are useful to look at)

BLAST - reading frames, what do you start with, why would you have vector contamination? (What do molecular biologists mean by the word "vector"?)

BLAST overview: Basic Local Alignment Search Tool

  Why do BLAST searches?

Program  Description
blastp Compares an amino acid query sequence against a protein sequence database.
blastn Compares a nucleotide query sequence against a nucleotide sequence database.
blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.

How? (Raw score) S = aI + bX - cO - dG

  • S = the sum of identities and mismatches minus penalties for gaps (and can include the size of gaps, too)

  • I = identity; X = mismatched nucleotides; O = gaps; c is the penalty for opening a gap and G is the total number of gaps and d is the penalty for those gaps.

How do we decide the score for substitutions? Scoring matrices - to objectify analysis.

  • NCBI discussion of PAM/BLOSSUM for comparison of protein sequences
    • PAM matrices (Dayhoff) "Point Accepted Mutations"; units are substitutions per 100 amino acids; few global alignments.- higher number PAM scores correspond to more diverged sequences.
      • If changes were purely random, frequency of each possible substitution would be determined by the frequency of the different amino acids, called the "background frequency". In related proteins, frequency of substitutions (target frequencies) are biased towards those that do not seriously disrupt a proteins function, i.e. "accepted point mutations".
    • BLOSUM matrices (Henikoff & Henikoff) "BLOcks SUbstitution Matrix"; units are threshhold percent similarity; many local alignments. -
    • higher number BLOSUM matrices relate to more conserved sequences
    • Example:
      • K

        A

        L

        M

        R

        PAM120

        V

        A

        K

        N

        S

        -4

        3

        -4

        -3

        -1

        -9

    • Smith & Waterman (1981) (UCSD discussion - very algorithm directed)
    • (Dynamic Programming - NIST)

E = mn2-s where m is the length of the query and n is the effective length of the database and s is the bit score - a measure of identity.

Caveats

  • Remember: Similarity does not imply homology! Homology is not directly observable.
  • Database annotation errors are easily propagated by inferring homology from similarity and assigning function based on FASTA description line of a matching sequence. (Boguski, 1999)
  • Recognize putative versus experimental qualifiers for evidence of function, i.e. how do we know what this protein does? Is it experimental results or sequence similarity?
  • Choose parameters wisely: default settings do not yield de facto superior results.
  • Choose scoring matrices in an informed manner, trying alternatives. BLOSUM/PAM

Whole genome sequencing, assembly, and annotation Powerpoint

Science Breakthrough of 2007 Human Genetic Variation: We will watch this video ~ 15 minutes.

Here's a link to the Encode project.

Genomics has been part of many of the Science Breakthroughs in the past 10 years.

2005: Evolution in Action

2000: Genomics

 

HOMEWORK Go to SGD, to ADY2, SNZ1, and TOR1.

  • Where in the yeast genome are each of these genes? How do you tell this from the Systematic name?
  • retrieve the nucleotide and amino acid sequence for all 3
  • Go to NCBI, perform BLASTn and tBLASTx on both sequences. For tBLASTx try BLOSUM80 and 45 scoring matrices. Evaluate your results.
  • What do the hits with different BLOSUM scores tell you?
  • Back to SGD, look at comparison resources: do an ortholog search (P-POD), did you get the same information, what other information did you find? Look at BLASTP hits in model organisms, did you get different information?
  • You are doing this gene by gene. How would you do this for an entire genome of 6000 genes? (note, it used to be possible to do "all against all" BLAST searches - but NCBI couldn't handle the load of that work on their server, so everyone who does this downloads tehir own BLAST search.
  • For TOR1, look at Comparison Resources -
    • what does the Psi-BLAST tell you?
    • scroll down on the PSI-BLAST page -what do the E values for Drosophila and Anopheles (mosquito) tell you about this gene?
    • Can you find a way to determine if there is a human disease associated with a mutation in this gene? (Hint: find a way to get to OMIM at NCBI)
    • One you are in OMIM - how easy is it to determine what diseases this might be associated with? Can you find any specific diseases?
    • How might you change the OMIM format or access so you could easily find out information?
  • What you need to realize is that the people developing these databases (some of whom you will meet this semester) are my age or slightly younger. They did not get their PhDs in genomics labs and I think these databases are having a hard time thinking about how to migrate to multigene searches.
  • Check out wormbase and let me know if you think this is any better for TOR.
  • Check out Gene Card. Was it easier to figure out diseases here? Do you feel confident you got the complete information? How would you improve this?

End of homework

Other resources:

© MWW 2008