Darya Chudova

Research projects


What makes unsupervised pattern discovery in sequences a hard problem? What is the effect of the alphabet size / pattern frequency / noise level etc. on the ability of algorithms to recover a particular recurrent noisy pattern from a set of training data?  What are the theoretical limits of performance for any algorithm on this problem? We answer these and related questions by deriving the Bayes error rate for pattern discovery in sequences under a Markov assumption. We also apply these results to the motif discovery problem in computational biology and show examples of problems where additional data sources (besides the primary sequence information) are needed in order to recover the motifs.

Recent advances in biotechnology, and DNA microarray technology specifically, facilitated an enormous amount of quantative information about the activity of the genes. Computational approaches are extremely helpful for uncovering the regulatory mechanisms from the huge collections of experimental data, especially when combined with extensive prior knowledge generated by the domain experts.

We are actively involved in building a probabilistic model of the regulatory network (in collaboration with Professor Eric Mjolsness). We formulate the model of interactions between the variables, both in terms of structure of such interactions and non-linear mappings governing the dynamics of the system. We aim at simultaneous recovery of the network topology  (from a given class of networks) and the dynamics of the expression levels. Preliminary results on the simulated data show the effectiveness and power of modern statistical algorithms such as Gibbs sampling and Markov Chain Monte Carlo in recovering the hidden interactions.

In this project, we studied applicability of SVMs to massive data sets. In particular, we choose the approach of scaling down the data set as opposed to scaling up the training algorithm. We achieved this by combining probabilistic interpretation of SVMs (see Sollich, 1999) with likelihood based squashing algorithms (Madigan et al., 2000). Likelihood based squashing allows to create a pseudo data set that preserves the statistical properties of the original data and contains much smaller number of training examples. We show the results of applying the technique to a few data sets from the UCI KDD archive.

o      Towards scalable support vector machines using squashing
D. Pavlov, D. Chudova and P. Smyth. Proceedings of the Sixths ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, 2000.
[Postscript] [PDF] – slightly modified version of the final paper
 

This is a class project on finding local rules in retail transaction data that I did for the Data Mining class in my first quarter as a graduate student (Fall 1999).

I was involved in developing algorithms for efficient pattern discovery in web navigation data. We ended up with an algorithm too similar to PageGather in Oren Etzioni’s paper on Adaptive Web Sites.

This project was done at HNC Software (now a part of Fair Isaac), in collaboration with David Speights and Joel Brodsky. We developed algorithms that can be used effectively to train neural networks with the censored data. We showed the modified likelihood equations for the corresponding model, as well as equations for parameter updates. The models have been used to predict claim duration for worker’s compensation claims.

o      Using Neural Networks to Predict Claim Duration in the Presence of Right
Censoring and Covariates

D. Speights, J. Brodsky and D. Chudova. Casual Actuarial Society Forum, Winter 1999
[PDF]

This project was done at the AI Laboratory of the Institute of Nuclear Physics, Moscow State University. We developed an effective system for real-time detection of organic pollutants in the sea water, based on the fluorescent spectra of the sample. Another project involved benchmarking of various neural network training algorithms for spoken word recognition and OCR.

This was the topic of my graduate research at the Moscow State University, department of Computational Maths and Cybernetics, advisor – Professor Valery Grigoryevich Sushko. I am very grateful to my advisor for his support and encouragement. At that time, I was still able to prove some new existence and uniqueness theorems for a class of ODEs!

 



Information and Computer Science
University of California, Irvine, CA 92697-3430