# Interface '01: Bioinformatics Day

## Saturday, June 16th Costa Mesa Room

Bioinformatics Day will bring together experts in statistics, computer science, biology, and medicine. The day will consist of 4 consecutive sessions on topical themes in bioinformatics, with invited talks from leading experts in each area.

#### Registration Inforamtion

• Bioinformatics Day is included in the regular registration for Interface '01.
• Single Day Registration for Bioinfo Day only is $120 (with a reduced registration fee of$50 for students).

## Biological Sequence Analysis

Costa Mesa Room
8:30am -- 10:15am

#### Comparative Genomics and the Future of Biological Knowledge

8:30am -- 9:15am
Anthony Kerlavage (Celera)

At Celera Genomics we have set our goal to become the definitive source of genomic and associated biological information that will be used by scientists to develop a better understanding of the biological processes in humans and deliver improved healthcare in the future. Using breakthrough DNA analysis technology applied to sequencing strategies pioneered at TIGR, the company has completed the sequencing of the Drosophila melanogaster, human, and mouse genomes. The whole genome shotgun method we employed enables the identification of a large number of computationally derived single nucleotide polymorphisms (SNPs). Celera's SNP Reference Database is intended to support the discovery and characterization of human genetic variation involved in disease, drug efficacy, and drug toxicity. The genetic variation data are fully integrated with the genome, gene, and protein structure data and linked to specific base pair locations on the genome. The database also features allele frequency, population data, validation status, and trace data for all variations. The sequencing of the human genome along with those of important model organisms enables the application of comparative genomics methodologies to the study of biological function. The identification of synteny between the human and mouse genomes improves the identification of genes in both organisms, enriches the functional annotation of the genes, and allows the wealth of mouse genetics data to be linked to human orthologs. The method also enables the identification of conserved regulatory regions. We have also developed techniques for creating protein families across all known proteins. This method creates Hidden Markov Models of protein families and sub-families. These HMMs can be used to classify novel proteins as they are discovered.

#### The Public Working Draft of the Human Genome

9:20am -- 9:45am
David Haussler (University of California, Santa Cruz), haussler@cse.ucsc.edu

A working group headed by Francis Collins, Eric Lander, Bob Waterston and John Sulston, in association with researchers at NCBI, EBI, UCSC and the major public sequencing centers, has produced and annotated the initial working draft of the human genome and published their findings in the Feb. 15 issue of Nature. Jim Kent at UCSC has made tremendous contributions to this effort. The public working draft sequence is currently assembled and made available at UCSC (http://genome.ucsc.edu) and a browser for the annotation of this sequence is available there, as well as at EBI (http://www.ensembl.org/). We discuss the bioinformatic challenges in assembling, annotating and periodically updating this current public working draft of the human genome.

#### Identification of Post-Translationally Modified and Mutated Proteins via Mass-Spectrometry

9:50am -- 10:15am
Pavel Pevzner (University of California, San Diego), ppevzner@cs.ucsd.edu

Although protein identification by matching tandem mass spectra (MS/MS) against protein databases is a widespread tool in mass-spectrometry, the question about reliability of such searches remains open. Most MS/MS database search algorithms rely on variations of the Shared Peaks Count approach that scores pairs of spectra by the peaks (masses) they have in common. Although this approach proved to be useful, it has high error rate in identification of mutated and modified peptides. We describe new MS/MS database search algorithms that implement the spectral convolution and spectral alignment approaches to peptide identification. We further analyze these approaches to identification of modified peptides and demonstrate their advantages over the Shared Peaks Count.

## Gene Expression Data Analysis

Costa Mesa Room
10:45am -- 12:30pm

#### Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and a Bayesian Statistical Framework

10:45am -- 11:30am
G. Wesley Hatfield (University of California, Irvine), gwhatfie@uci.edu

The recent availability of complete genomic sequences and/or large numbers of cDNA clones from model organisms coupled with technical advances in DNA arraying technology have made it possible to study genome-wide patterns of gene expression. However, despite these rapid technological developments, the statistical tools required to analyze DNA microarray data are not in place. DNA microarray data often consist of expression measures for thousands of genes, but experimental replication at the level of single genes is often low. This creates problems of statistical inferences since many genes show large changes in gene expression by chance alone. Therefore, to interpret data from DNA microarrays it is necessary to employ statistical methods capable of distinguishing chance occurrences from biologically meaningful data. Commonly used software packages are poorly suited for the statistical analysis of DNA microarray data. However, we have created a program, Cyber-T, which accommodates this approach. Cyber-T is available for online use at the genomics Web site at the University of California at Irvine http://www.genomics.uci.edu. This program is ideally suited to experimental designs in which replicate control measurements are being compared to replicate experimental measurements. In the study reported here, we use the statistical tools incorporated into Cyber-T to compare and analyze the gene expression profiles obtained from a wild-type strain of {\it Escherichia coli\/} and an otherwise isogenic strain lacking the gene for the global regulatory protein, integration host factor (IHF). Several decades of work with this model organism have produced a wealth of information about its operon-specific and global gene regulation patterns. This information makes it possible to evaluate the accuracy of data obtained from DNA microarray experiments, and to identify data analysis methods that optimize the differentiation of genes expressed because of biological reasons from false positives (genes that appear to be differentially expressed due to chance occurrences). We apply different statistical methods for identifying genes showing changes in expression to this data set and show that a Bayesian approach identifies a stronger set of genes as being significantly up- or down- regulated based on our biological understanding of IHF regulation. We show that commonly used approaches for identifying genes as being up- or down- regulated (i.e., simple t-test or fold change thresholds) require more replication to approach the same level of reliability as Bayesian statistical approaches applied to data sets with more modest levels of replication. We further show that statistical tests identify a different set of genes than those based on fold-change, and argue that a set of genes identified by fold change is more likely to harbor experimental artifacts.

#### Statistical Issues, Data Analysis, and Modelling for Gene Expression Profiling

11:35am -- 12:00pm
Mike West (Duke University), mw@stat.duke.edu

The talk will cover aspects of statistical analysis of oligonuceotide microarray data, with a specific focus expression profiling in cancer. Among topics to be discussed and exemplified include issues arising in developing predictive regression models for evaluating and characterising expression patterns associated with clinical or physiological states --- i.e., formal statistical approaches to the canonical molecular phenotyping problem. I will discuss the use of singular-value regression ideas within a Bayesian framework that addresses the large $p$, small $n$'' regression problem posed in such applications. Breast cancer studies highlight critical challenges, including data quality, definition, variable (gene) selection, and aspects of model evaluation using cross-validation. Two specific applications in breast cancer will be used to convey and review basic data analysis and modelling issues, as well as to highlight conceptual and practical research challenges in this area. The use of such expression data in developing models of gene interactions that may relate to underlying regulatory networks will be mentioned, as will a range of issues germane to the oligonucleotide technology.

#### Plaid Models for DNA Microarrays

12:05pm -- 12:30pm
Art Owen (Stanford), art@stat.stanford.edu

This talk describes the plaid model, a tool for exploratory analysis of multivariate data. The motivating application is the search for interpretable biological structure in gene expression microarray data. Interpretable structure can mean that a set of genes has a similar expression pattern, in the samples under study, or in just a subset of them (such as the cancerous samples). A set of genes behaving similarly in a set of samples, defines what we call a layer''. These are very much like clusters, except that: genes can belong to more than one layer or to none of them, the layer may be defined with respect to only a subset of the samples, and the role of genes and samples is symmetric in our formulation. The plaid model is a superposition of two way anova models, each defined over subsets of genes and samples. We will present the plaid model, an interior point style algorithm for fitting it, and some examples from yeast DNA arrays and other problems. This is joint work with Laura Lazzeroni.

## Medical Informatics

Costa Mesa Room
2:00pm -- 3:00pm

#### Integrating Data and Disciplines: Biostatistics and Biomedical Informatics

2:00pm -- 2:25pm
Joyce Niland (City of Hope), JNiland@coh.org

As predicted by Dr. Richard Klausner, Director of the National Cancer Institute (NCI), Biomedical informatics is the future of integrating science and medical research.'' In response to the explosion of data from the sequencing of the human genome and the critical objective of correlating genotypic and phenotypic data, informatics tools for collecting, managing, retrieving, and analyzing both clinical and genetic data are needed. At City of Hope we have formed a Division of Information Sciences to create and maintain such systems, linking the highly synergistic and inter-dependent disciplines of Biostatistics and Biomedical Informatics. These systems support the conduct of clinical research, the mining of genetic data, and the ultimate integration of data stemming from numerous source data systems. Internet-based interfaces and data warehousing principles factor strongly in the systems under development through collaborative work between faculty and staff members in Biostatistics and Biomedical Informatics. Several key data systems will be described, along with the human, information and technological issues that must be considered in their creation and maintenance. Our research data warehouse architecture will be outlined, representing an optimal form of data integration in support of clinical and genetic research.

#### The Trouble with Text: Challenges and Promises of Biomedical Information Retrieval Technology

2:30pm -- 2:55pm
Wanda Pratt (University of California, Irvine), pratt@ics.uci.edu

The ultimate motivation for bioinformatics researchers is to improve health care, but the traditional publication and retrieval of results in textual journal articles has become a bottleneck in the dissemination of scientific advances. The millions of available articles, even within a narrow subfield, easily overwhelm both biomedical researchers and health-care providers, regardless of whether they are pursuing a research hypothesis or deciding how to care for a particular patient. Many characteristics of traditional word-based approaches make efficient and effective retrieval of biomedical texts particularly challenging. In this presentation, I will discuss examples of recent approaches that take advantage of common knowledge of biomedicine and representation standards to address these challenges and provide improved organization and retrieval of biomedical textual information.

## Automated Analysis of Brain Images

Costa Mesa Room
3:30pm -- 5:15pm

#### On Metrics and Variational Equations of Computational Anatomy

3:30pm -- 4:15pm
Michael Miller (Johns Hopkins University), mim@cis.jhu.edu

We review recent advances in the Emerging Discipline of Computational Anatomy. We begin by defining anatomy as an orbit under groups of diffeomorphisms. Metrics on the orbits are defined via geodesic distance in the spaces of diffeomorphisms. This induces a metric on the space of anatomical images. Estimation problems are defined associated with estimating the underlying diffeomorphisms generating the anatomies from Medical images. The variational problems defined are associated with finding the infimum length (Energy) diffeomorphism connecting the Medical imagery with the underlying anatomical structures. Euler-Lagrange equations will be discussed. Applications will be described to the study of the neocortex of the macaque, tumor generation, and hypothesis testing in neuropsychiatric applications in Aging and Schizophrenia.

#### Visual Analysis of Variance: A Tool for Quantitative Assessment of fMRI Data Processing and Analysis

4:20pm -- 4:45pm
William F. Eddy (Carnegie Mellon University), bill@stat.cmu.edu
R. L. McNamee (University of Pittsburgh), rlandes@neuronet.pitt.edu

The raw data from an fMRI experiment are pre-processed before statistical analysis in order to produce useful results. This includes an inverse Fourier transform and other steps, each intended to improve the quality of the resulting images. Each pre-processing step has been justified by careful study of some prior experiment. On the other hand, the formal statistical analysis often contains within it some internal assessment of the quality of the experimental data, such as the residual sum of squares from a fitted model. Here, we propose an analog to the analysis of variance, which we call a Visual Analysis of Variance (VANOVA), as a tool for assessing the importance of each pre-processing step. A VANOVA provides both quantitative and visual information which aid in the assignment of causes of variability and the determination of statistical significance. In fact, a VANOVA is the natural extension, to the pre-processing, of an ANOVA assessment of the statistical modeling. Because the VANOVA is not intended to be as formal as an ANOVA, it can include evaluation of pre-processing steps which are neither linear nor orthogonal, both requirements of the partitions in an ANOVA.

#### Positron Emission Tomography: Image Formation and Analysis

4:50pm -- 5:15pm
Richard Leahy (University of Southern California), leahy@sipi.usc.edu

Positron Emission Tomography is a powerful medical imaging modality for investigating human and animal biochemistry and physiology. Detection of photon pairs produced by positron-electron annihilation produces tomographic projections of the spatial distribution of positron-emitting nuclei. Tomographic reconstruction methods can then be used to form volumetric images. By labelling biochemicals with positron-emitting nuclei, we can produce images of a wide variety of biochemical and physiological processes. PET is now widely used in detecting and staging cancer through imaging of glucose. In the brain, receptor and transmitter ligands have been developed to study the dopamine and other neurochemical systems. The most exciting recent development in PET is the ability to directly image gene expression through the use of PET tracer/reporter gene combinations. In humans this technique can be used, for example, to monitor the efficacy of gene therapy techniques. PET gene expression imaging is also being increasingly used to studygene expression in transgenic animals. Essential to the success of positron tomography in these diverse applications is a combination of instrumentation-design optimized for specific applications (e.g. humans vs. small animals) and image processing methods that maximize image quality when forming volumetric images from PET data. After reviewing the instrumentation and principles behind PET, I will describe statistically-based approaches to reconstructing PET images. Using a Bayesian formulation, we combine accurate physical models of the physics underlying PET systems with accurate statistical models for photon limited data collected in these systems. I will illustrate the impact of this approach on image quality through examples for clinical and animal studies. I will also describe our current work on evaluating image quality through a combination of theoretical, Monte-Carlo and human-observer studies.