THE USE AND ANALYSIS OF
MICROARRAY DATA
Atul Butte

Question:
How can we use microarray data to accurately address significant biological questions?

800+ organisms sequenced (in 2002), therefore techniques analyzing the whole genome are available

1

Microarrays: 4,000 – 50,000 spots (currently up to 2.1 million spots/slide)

What might you use microarrays to analyze/explore?

What is an inherent difficulty with using microarray data (~4,000-2.1 million spots per sample)?
            Statistical significance – usually 0.05 (= 1/20 chance of item x occurring by random chance)
            Sheer computing power
            Expense – plan your experiments well!

Analysis:
            Normalization, adjust laser intensity
Controls:
Microarray-based:
housekeeping genes (ex. RNA polymerase II), spiked controls
self-on-self arrays, lots of replicates
assumption that few genes will change,
splines, nonlinear techniques
Noise analysis: Principle Component Analysis (PCA), Pearson correlation (R2)
Experimental control: Similar times of collection/age/genetic homogeneity/tissue type
            External verification:
                        Northern blot
                        SAGE
                        qPCR (also known as Real Time PCR)

If you were making a quantitative array, what would you need to know about the density of your spots? What control(s) would you need to add to make sure you were measuring only what you thought you were?

Are microarrays quantitative?
Quantitative vs. non-quantitative microarrays:
            cDNA: ratio based, non-quantitative
            Oligo: Using Affymetrix design, are quantitative
            Striking lack of correlation between the two

Analysis II: What are the two major approaches to extracting interesting data?

            Supervised:
                        Look for significant differences between groups
                        Look for markers that characterize a tissue
Differential expression: absolute expression, change between samples fold change between samples, reproducibility
Nearest Neighbor, Support vector

Unsupervised:
            Feature determination: what has interesting regulation patterns (no specific pattern)?
                        PCA
Cluster determination:
Nearest neighbor clustering, self-organizing maps, k-means clustering, 2-D hierarchical clustering (all addressed later)
Network determination: gene-gene, gene-phenotype interactions
                        Boolean networks, Bayesian networks, relevance networks
           

Clustering: Uses dissimilarity measures and makes clusters based on them.
Euclidian distance: Data must be normalized or apparent expression differences will cluster, and opposing regulation will be missed (ex. a gene and its inhibitor won't cluster)
Pearson correlation coefficient: Assumes normal distribution and linear relationship
            Mutual information: Unbiased by outliers, but requires categorization/bins (e.g. 'high' and 'low')

1

How do the different supervised/unsupervised methods work?

Hierarchical clustering: iteratively groups correlated genes – starts with small clusters, then clusters the clusters, etc.
            Displayed with a dendrogram (branch length is important)
            It is quick, but it ignores negative associations, highly sensitive to initial starting clusters

Self-organizing maps:
            Also provide a survey of expression patterns, but cover more dimensions
Uses a dissimilarity measure, starts with an entered number of clusters distributed randomly across the map and stops when clusters no longer move relative to these clusters.
It is sensitive to the starting centroids/can be non reproducible, genes can only belong to one cluster

            Relevance networks:
                        Genes are compared pairwise and also to all other genes by plotting other genes on a                                   scatterplot of their expression levels as coordinates. Correlation is determined and only                                 genes matching above a threshold are clustered.
                        Genes can be in more than one cluster
                        If thresholds are too low, it becomes very complex.

            Principle component analysis:
                        A set of vectors that captures the variance seen in a set of samples (ordered from largest                          to smallest amount of variance explained)
                        Vectors calculated as linear combinations of genes
                        May not tell how to discriminate genes
                        Biological relevance of components is not always intuitive

Supervised:
Nearest neighbors: Looks for pre-specified pattern of expression e.g. high in sample 1, low in sample 2. Does not find the smallest set of genes that differentiates samples.

Support vector machines:
Looks for sets of genes that distinguish biological samples. It mathematically combines expression values for different genes to further distinguish sample sets.
Each gene is a point in multidimensional space with location based on expression level.
Kernel functions may not obviously relate to biological functions

How do we use these techniques to answer hypotheses / how do we form hypothesis to take advantage of this data?
            Directed questions usually call for supervised methods, while exploratory questions tend to utilize unsupervised methods
1

"42": Now that we have an "answer," what does it mean?
            We must integrate this data with the rest of the discovery pipeline

 

A Gene Expression Map for Caenorhabditis elegans
Stuart Kim, et al.

Question:
How can we correlate large amounts of microarray data from various experiments in order to extract useful information? What can we learn from it?
Specifically: What genes cluster with known functional genes (in essential pathways)?

Problem:
Many identified genes (~20,000), but few identified functions (6%) and only half are homologous with genes from other organisms.

Method:
Cy3/Cy5 12,000 or 18,000-spot cDNA microarrays, hybed and analyzed by Stuart Kim's lab
(http://cmgm.stanford.edu/~kimlab/index_methods.html#%3CDatabase%3E)
            Collaborated with over 30 other labs
            Mutant vs. wildtype under various conditions
                        Heat shock, Ras (kinase) signaling, dauer stage, germline, sex regulation
            553 total microarrays (~1/3 12,000 spots, ~2/3 18,000 spots)
How would you try to visualize the correlated data?

Fig 1
1Set up a spreadsheet: 1 gene per row, 1 array/experiment per column.
Pearson correlate across genes and put on a 2-D map (Vxinsight – by Sandia) (z-axis/height then denotes density of spots for a mountain-like appearance.)
(Attractive and repulsive force)

Mapping results:
44 mountains total
How might you test to see if the correlations are data-based and not a byproduct of the program?
Controls: Random shuffling of values, added noise, started at different points, split into two data sets, known functionally clustered genes.
Inferred physiological function for 30 of 44 mountains
Notable correlations:
Specific tissue enrichment (muscle, neuron, germline).
Cellular function (histones, ribosomal genes)
1
1
Oocyte clusters: Mounts 7, 11, 18
Looked at 88 neuronal genes (PDZ = anchoring domain; anchors transmembrane proteins to the cytoskeleton and holds together signaling proteins)

1

Mount 4: Large enrichment of protein kinases and phosphatases, potentially used for regulation since protein synthesis and degredation are not common.

1
They were able to subdivide the germline genes into 4 groups.
11111

Surprising results: Tissue differentiation in the peaks