Pierre Baldi

School of Information and Computer Sciences (ICS)
Institute for Genomics and Bioinformatics (IGB)
University of California at Irvine ( UCI)

Probabilistic Modeling
of Biological Data

ICS 277B
Pierre Baldi

Course - Prerequisites - Textbook - Grading - Schedule - Other

ICS 277B SPECIAL TOPICS IN INFORMATION AND COMPUTER SCIENCE:
PROBABILISTIC MODELING OF BIOLOGICAL DATA

Course Goals and Description

This is a graduate level course on probabilistic modeling of biological data. The course covers computational approaches to understanding and predicting the structure, function, interactions, and evolution of DNA, RNA, proteins, and related molecules and processes. The emphasis is on providing a unified Bayesian statistical framework to mine large noisy data sets that are becoming the hallmark of modern biology. The methods taught focus on developing the structure of the models, on model fitting algorithms (machine learning), and on the application of the resulting models (data mining). Most applications will revolve around DNA, RNA, protein sequence, and gene-expression-array data, but other types of data will also be considered depending on participants interests.

The official catalog description is:

ICS 277B: Probabilistic Modeling of Biological Data. A unified Bayesian probabilistic framework for modeling and mining biological data. Applications range from sequence (DNA, RNA, proteins) to gene expression data. Graphical models, Markov models, stochastic grammars, neural networks, structure prediction, gene finding, evolution, DNA arrays single and multiple gene analysis.

Course - Prerequisites - Textbook - Grading - Schedule - Other

Prerequisites

A basic course in algorithms (ICS 161 or equivalent) and in molecular biology (Bio Sci 99 or equivalent), or ICS 277A (or equivalent), or consent of instructor. Course assumes some background in biology, and basic knowledge of probability, statistics, and programming.

Course - Prerequisites - Textbook - Grading - Schedule - Other

Textbooks

Bioinformatics: the Machine Learning Approach
Pierre Baldi and Soren Brunak, Second Edition, 2001, (MIT Press)

DNA Microarrays and Gene Regulation: From Experiments to Data Analysis and Modeling
Pierre Baldi and G. Wesley Hatfield, 2002, (Cambridge University Press)

Course - Prerequisites - Textbook - Grading - Schedule - Other

Grading

Students will read articles from the literature. Grading will be based on participation in class discussions, presentations, and possibly a final project requiring a computational analysis of biological data, which will result in a brief (5--10 pages) conference-style written report. Additional assignments can include homeworks.

Course - Prerequisites - Textbook - Grading - Schedule - Other

Tentative Schedule

N.B.: Schedule may change to follow class interest, schedule outside speakers, etc.

	Week 1: Introduction to Bioinformatics. Probabilistic Modeling: the Bayesian Statistical Framework.
	Week 2: Graphical Models. Simple Markov models of Biological Sequences (HMMs).
	Week 3: Hidden Markov Models of Biological Sequences.
	Week 4: HMMs, Probabilistic Models of Genes, and Gene Finding Algorithms. Probabilistic Models of Genes and Gene Finding Algorithms.
	Week 5: Probabilistic Models of Evolution and Phylogenetic Trees.Stochastic Grammars and Languages.
	Week 6: Stochastic Context Free Grammars and RNA Secondary Structure. Beyond Context Free Grammars.
	Week 7: Probabilistic Modeling and Neural Networks. Machine Learning Approaches for Protein Structure Prediction.
	Week 8: Machine Learning Approaches for Other Problems (Signal Peptides, etc). DNAl Microarray Data and Gene Regulation
	Week 9: Probabilistic Modeling of DNA MicroArrays: Single-Gene Level. Probabilistic Modeling of DNA MicroArrays: Multiple-Gene Level. Gene and Protein Networks. Systems Biology.
	Week 10: Project Presentations.

Course - Prerequisites - Textbook - Grading - Schedule - Other

Other

Texts on reserve at the UCI Science Library

	Bioinformatics: the Machine Learning Approach by Pierre Baldi and Soren Brunak.
	DNA Microarrays and Gene Expression by Pierre Baldi and G. Wesley Hatfield
	Biological Sequence Analysis by Richard Durbin et al.
	Introduction to Protein Structure by Carl Branden and John Tooze.
	Introduction to Computational Biology by Michael S. Waterman.
	Artificial intelligence and molecular biology edited by Lawrence Hunter.
	Mathematical methods for DNA sequences edited by Michael S. Waterman.

Relation to Other Courses

This course is intended to complement the existing ``hands-on'' computer based courses Biological Sciences 123/223 (Computer Applications in Molecular Biology/Computational Molecular Biology), which give a very practical introduction to using computer tools in molecular biology. In contrast, this course emphasizes the development of probabilistic models and machine learning approaches for the analysis of biological data. This course is also intended to closely complement the existing ICS course``Representations and Algorithms for Molecular Biology'' (currently ICS-277 and scheduled to become ICS-277A). In contrast, this course emphasizes modeling and analysis of biological data using a probabilistic framework. The probabilistic approach is essential to account for biological variability brought about by evolutionary tinkering. The course can be viewed as data mining, machine learning, and probabilistic algorithms, concentrated on biological data sets, especially sequence data, but also including other data sets, such as gene expression data, depending on student interest.

There is essentially no overlap between this course and ICS 246, as well as ICS 248. There is a small overlap with ICS 275B and with 283. The overlap with 275B is in the use of graphical models. Not all the graphical models used in 277B, however, are Bayesian networks. Furthermore, the Bayesian networks used in 277B are very specialized and come with their own algorithms (forward-backward, inside outside) etc. There is also a small overlap with ICS 273 (machine learning) but the approach in 277B is more probabilistic and, once more, focused exclusively on biological problems. ICS 277B could benefit students who have taken ICS 275B and/or ICS 273 by deepening their understanding of graphical model/machine learning concepts and letting them apply systematically to problems in biology.

Finally, ICS 277B complements a course such as 223 (Molecular Biology and Biochemistry) by focusing on the application of computational methods to the solution of biological problems.

This course is part of the new ICS concentration: Informatics in Biology and Medicine.

Course - Prerequisites - Textbook - Grading - Schedule - Other

Probabilistic Modeling of Biological Data

ICS 277B Pierre Baldi

Course Goals and Description

Prerequisites

Textbooks

Grading

Tentative Schedule

Other

Texts on reserve at the UCI Science Library

Relation to Other Courses

Probabilistic Modeling
of Biological Data

ICS 277B
Pierre Baldi