ICS COLLOQUIA


 

What is Information?
Pierre Baldi
, ICS, UC Irvine
December 12, 2000
Abstract

Strategies for Boosting and Prediction
Greg Ridgeway
, Associate Statistician, RAND Statistics Group
December 5, 2000
Abstract

Visual Monitoring of Satellite Data with TowerView
Irv Harrison, High Tower Software, Irvine, CA
November 30, 2000
Abstract

Next-Generation Information Systems
Avi Silberschatz
, Information Sciences Research Center, Bell Laboratories
November 29, 2000
Abstract

Statistical Analyses of Genetic (Family Based) Data
Argyrios Ziogas,
Adjunct Prof., Biostatistics & Epidemiology, Department of Medicine, UCI
November 21, 2000
Abstract

On The Impact of Aggregation on The Performance of Traffic Aware Routing
Roch Guerin,
Department of Electrical Engineering, University of Pennsylvania
November 1, 2000
Abstract

Bayesian Learning in System Performance Management
Patrick Schaumont,
Researcher, Digital Broadband Terminal Group, IMEC, Belgium
October 30, 2000
Abstract

From Files to Documents: Integrating Database Concepts
Gabrio Rivera,
ETH Zurich, Switzerland
October 26, 2000
Abstract

Reflections on Ken Thompson's Reflections on Trusting Trust
Wolfgang Goerigk,
Institut fuer Informatik and Praktische Mathematik, Christian-Albrechts-Universitaet zu Kiel
October 25, 2000
Abstract

Tea (Tool for Embedded Applications)
Jorge Nunes,
CTO, Dev. Web, Lisbon, Portugal
October 5, 2000
Abstract

Bayesian Learning in System Performance Management
Irina Rish,
IBM T. J. Watson
July 7, 2000
Abstract

MATCH: A MATlab Compilation Environment For Distributed Heterogeneous Adaptive Systems
Prith Banerjee,
Northwestern University
June 29, 2000

Research and Applications of Distributed Shared Memory at Ulm University
Peter Schulthess,
Distributed Systems Laboratory, Ulm, Germany
June 23, 2000
Abstract

Characterizing Model Errors and Differences
Stephen Bay,
ICS, UC Irvine
June 22, 2000
Abstract

Compiler Optimizations for Embedded VLIW Processors
Rainer Leupers,
University of Dortmund, Germany
June 13, 2000
Abstract

Robust Multi-scale Image and Data Analysis by Maximum Entropy
Professor Joachim M. Buhmann,
Bonn University
June 7, 2000
Abstract

Bistro: a Platform for Building Scalable Wide-Area Upload Applications
Prof. Leana Golubchik,
University of Maryland, College Park
June 7, 2000
Abstract

Indexing Mobile Objects
Dimitrios Gunopulos,
UC Riverside
May 23, 2000
Abstract

To Catch a Thief-The Markov Modulated Poisson Process As a Tool for Detecting External Fraud
Steven Scott,
Assistant Professor of Statistics, Marshall School of Business, University of Southern California
May 19, 2000
Abstract

Planning as Heuristic Search
Blai Bonet,
University of California, Los Angeles
May 18, 2000
Abstract

Extreme Dimensionality Reduction in Text Learning
Gary Boone,
Georgia Institute of Technology
May 11, 2000
Abstract

Event Detection and Storage from Automatic Surveillance Systems
Simone Santini,
Praja Technologies, San Diego, CA
May 4, 2000
Abstract

Fixing the Computer World with bidirectional connections
Theodor Holm Nelson,
Project Xanadu and Keio University
March 31, 2000
Abstract

Estimating the Size of Populations
Stephen Fienberg,
Professor of Statistics and Social Sciences, Acting Director, Center for Automated Learning and Discovery - Carnegie Mellon University
March 30, 2000
Abstract

Presence Technology: The Evolution of Multiple Perspective Interactive Video
Dr. Simone Santini, PRAJA Technologies, San Diego, CA
February 16, 2000
Abstract

Challenges in Human Genetics
Anne Spence, University of California, Irvine
December 15, 2000
Abstract

Research on Adaptive Web Content Delivery
Jin-Lin Chen, Microsoft China
January 26, 2001
Abstract

Feature Subset Selection for Unsupervised Learning Applied to Content-Based Image Retrieval
Jennifer Dy, Department of Electrical Engineering Purdue University
January 9, 2001
Abstract

Dynamically Detecting Likely Program Invariants
Michael Ernst, MIT Lab for Computer Science
February 23, 2001
Abstract

Colloquia Archive



What is Information?
Pierre Baldi

While eminently successful for the transmission of data, Shannon's theory of information is not well suited to handle semantic dimensions and in particular the degree of relevance of the data for a receiver that has limited available knowledge about the source--a very common situation in science and human interactions. To overcome these limitations, we propose an bserver-dependent theory of information, we call the relevance. In the relevance theory, information is what leads an observer to revise his degrees of belief, and in particular to update prior probabilities into posterior probabilities. We show how the relevance can be computed in a number of cases and how it differs and complements Shannon's definition of entropy.



Strategies for Boosting and Prediction
Greg Ridgeway

In recent years statisticians, computer scientists, and engineers have developed more advanced techniques to learn complex non-linear relationships from datasets. Boosting is a technique that has received a lot of attention from all of these communities and shows promise as a handy tool for the data analyst's toolbox. These methods may easily incorporate many desirable properties including prediction accuracy, automation, robustness, variance reduction, model interpretability, and scalability to massive datasets. Drawing from examples in regression, classification, survival analysis, and density estimation, I will demonstrate on both real and simulated datasets that boosting consistently outperforms standard methods in terms of error on validation datasets. www.rand.org/centers/stat/


Visual Monitoring of Satellite Data with TowerView
Irv Harrison

In an effort to keep pace with the worldwide wireless revolution, satellite communications service providers have launched, or are scheduled to launch, large numbers of satellites in potentially mixed constellations. These constellations will provide such diverse services as satellite-based mobile phones, short message services (SMS), broadband Internet in the sky, paging services, and two-way broadband communications. In light of these developments, issues related to satellite monitoring and control are bound to challenge satellite operators as never before. An operations team that monitored 10,000 parameters five years ago may now have to react to more than ten times as much data. Moreover, simply monitoring the health of a constellation is no longer sufficient. In order to ensure performance and customer satisfaction, today's operations team must also monitor in real-time the satellite payload, which may consist of beacons, transponders, and carriers. Current monitoring tools that are based on display hierarchies - a design originally developed for small satellite constellation operations - do not scale sufficiently well to accommodate the hundreds of thousands of data parameters from these new constellations. The appropriate strategy for meeting these challenges is not to add staff or new control centers, but to take advantage of a new 3-D visual monitoring technology that provides instant access to data alarm distributions and severities. This technology employs trend alarms, derived parameters, and conditional alarms to alert the operations staff to problems before they happen. The environment is fully visual and can accommodate over 200,000 parameters in a single display. Visual monitoring gives the operations team a qualitative high-level view of the entire constellation, including payload, the ground station network, and individual gateways. The management and engineering staff can also monitor data remotely, giving them a direct interface to the satellite operations and network centers from any location. In addition to reducing staffing requirements, the remote monitoring capability supports lights-out operations because it allows engineering and operations personnel to receive alarm notification via pager or email. The training requirements for visual monitoring are minimal; in fact, most users can be trained in less than one day.



Next-Generation Information Systems
Avi Silberschatz

Avi Silberschatz is the Vice President of the Information Sciences Research Center at Bell Laboratories, Murray Hill, New Jersey. Prior to joining Bell Labs, he held a chaired professorship in the Department of Computer Sciences at the University of Texas at Austin. His research interests include operating systems, database systems, and distributed systems. His most recent research has focused on the areas of multimedia storage servers, high-performance databases, real-time rating and billing systems, and real-time operating systems.



Statistical Analyses of Genetic (Family Based) Data
Argyrios Ziogas

The field of genetic epidemiology is concerned with finding genes related to complex diseases -traits that have simple mendelian patterns of inheritance and traits that depend on multiple genetic or environmental factors- and once these genes are identified describe the population characteristics, and their effects on penetrance including gene X environment interactions. The development of statistical methods for studying familial diseases, sought to clarify the mode of transmission of major genes through segregation analysis or to clarify the location of disease genes relative to other genes of known chromosomal location through linkage analysis, by collecting data on related individuals. We present statistical methods based on maximum likelihood or Markov chain Monte Carlo in order to determine whether the observed pattern of disease among relatives is compatible with a single major gene, polygene, or simply shared environment. We include models with latent and observed covariates, and we discuss the role of ascertainment (sampling of families) and measurement error on the estimation of the parameters. We present some examples where the MCMC methods can offer very efficient estimating methods in the presence of latent and misclassified covariates.



On The Impact of Aggregation on The Performance of Traffic Aware Routing
Roch Guerin

This paper investigates the impact of traffic aggregation on the performance of routing algorithms that incorporate traffic information. The focus is on two main issues. Firstly, we explore the relationship between the average performance of the network and the level of granularity at which traffic can be assigned to routes. More specifically, we are interested in how average network performance improves as the ability of the routing protocol to split traffic arbitrarily across multiple paths in the network increases. Secondly, we focus on the impact of traffic aggregation on short-term routing behavior. In particular, we explore the effects of traffic aggregation on traffic variability, which directly affects short-term routing performance. Our analysis is based on traffic traces collected from an operational network. The results of this study provide insights into the cost-performance trade-offs associated with deploying routing protocols that incorporate traffic awareness. (Joint work with: A. Sridharan, S. Bhattacharyya, C. Diot, J. Jetcheva, and N. Taft)



From Files to Documents: Integrating Database Concepts within the File System
Gabrio Rivera

The exponential growth in hardware performance has reduced to a minimum the distinction on the software side between basic functionality, provided by the operating system, and extended functionality, provided by applications. Nowadays, a lot of advanced features such as multimedia handling are already integrated into the system itself. Given this trend, it is surprising that the current vision of the file management system still remains so close to the level of the physical storage structure of a file, following the basic equivalence of "file = pathname". In this talk, OMX-FS will be presented, a multi-user distributed object-oriented file system, based on the generic object model OM. OMX-FS considers files as objects and introduces typical concepts of object-oriented database systems, enabling both application developers and users to interact with the file system at a higher-level of logical abstraction.



Reflections on Ken Thompson's Reflections on Trusting Trust
Wolfgang Goerigk

Computer based systems are often heavily annoying their users. Errors, failures and crashes occur, due to bugs in design, implementation or integration. Hardware bugs are sensations, whereas software errors are commonplace. Even intentional errors like viruses and Trojan Horses show up time by time. Although compilation is a major topic in theoretical and practical computer science research and development since more than 40 years, correctness of realistic compilation is still a severe gap in trusted software production. In this talk we will present a security-related motivation for proving compilers correct, and in particular for binary compiler implementation verification. We will adopt the scenario of a well-known attack to Unix operating system programs due to Trojan Horses deeply hidden within a compiler executable. Such a compiler passes nearly every test, even Wirth's strong bootstrap test, state of the art compiler validation, and no amount of source level scrutiny, code inspection or verification does help. It nevertheless might eventually cause a catastrophe. We will show such a program in detail and give a running demonstration. Based on self-reproducing and reflective code, it is surprisingly easy to construct. In that, we share a common experience with Ken Thompson, who documented such kind of attacks in his Turing Award Lecture in 1984. We will also sketch a correct way out and some other recent work from the context of the German joint project Verifix on Correct Compilers.


Tea (Tool for Embedded Applications)
Jorge Nunes

Tea (Tool for Embedded Applications) is a high level scripting language for the Java environment. It has built-in support for all major programming paradigms, namely imperative (object oriented and procedural) and functional. Its major strengths are consistency, simplicity, easy extensibility and easy integration into any Java environment. These advantages are in addition to the intrinsic advantages of being a scripting language, these being rapid development turnaround, fast prototyping and acting as the glueing logic for components. These talk aims at giving a brief overview of the Tea scripting language, present its advantages and describe some real world projects where it has been used.


OCAPI-xl: Uniform Design Refinement of Hardware/Software Platforms
Patrick Schaumont

Good design practice in the area of electronic digital systems requires high-level abstractions for representation and simulation of system parts. At the same time, these abstractions should also have an implementation view, which can vary a lot over different targets (processor, FPGA, ASIC). An object oriented language such as C++ is an ideal tool to reach this goal. Using the object oriented properties of the language, we can replace the classic design environment (that uses a fixed design data model and flow) by a programming-and-refine approach. The programming aspect allows to achieve better system level simulations, while the refinement aspect links abstract representations to more concrete ones that can be implemented. OCAPI-xl is a C++ library that supports this design concept. The design model offers a uniform representation for both hardware and software parts of a system, and at the same time also has an automated path to implementation. The design of a stand-alone FPGA-based Web Camera with OCAPI-xl is taken as an example, and it is shown how typical design problems of embedded, networked devices (e.g. the estimation of protocol stack throughput) can be tackled in OCAPI-xl. Finally, the links between OCAPI-xl and platform design will be discussed, including reconfiguration and architecture abstraction. OCAPI Homepage: http://www.imec.be/ocapi


Bayesian Learning in System Performance Management
Irina Rish

Providing good quality of service (e.g., low response times) in distributed computer systems requires measuring end-user perceptions of performance. Unfortunately, such measures are often expensive or impossible to obtain. We propose a machine-learning approach to recognizing end-user transactions consisting of sequences of remote procedure calls (RPCs) received at a server. This approach combines naive Bayes classifier (with several combinations of feature vectors and probability distributions) for transaction classification with dynamic-programming Viterbi algorithm for segmenting a sequence of RPCs into transaction instances. Good accuracies are obtained for both problems of classification (87%), and for more complex segmentation problem (64%). Our current work focuses on extending naive Bayes to (unrestricted) Bayesian network classifiers for transaction recognition and other problems related to performance management. Traditional approach to constructing Bayesian networks is knowledge engineering involving a domain expert. Recently, increasing volumes of data available in a variety of real-life applications, including web, e-commerce, and bio-informatics, motivated an active research on learning Bayesian networks from data. A natural way of combining prior knowledge with data is one of the major advantages of Bayesian networks. Besides, they provide a possibility for learning causal relationships, for convenient handling incomplete data, and for incremental learning, among other advantages. The second part of this talk will provide a brief survey of the state-of-the-art techniques for learning Bayesian networks.


Research and Applications of Distributed Shared Memory at Ulm University
Peter Schulthess

A native DSM operating system for PC clusters is developed in the lean systems tradition of "Oberon and friends". We anticipate that distributed shared memory provides a convenient base for building networked communities focussing on individual topics, for document sharing and cooperative work and Collaborative Browsing. The DSM mode of operation will improve performance, consistency, administration and ease of programming of distributed application and retrieval scenarios.


Characterizing Model Errors and Differences
Stephen Bay

A critical component of applying machine learning algorithms is evaluating the performance of the models induced and using the evaluation to guide further development. Traditionally the most common evaluation metric is error or loss, however this provides very little information for the designer to use when constructing a system. We argue that an evaluation method should provide detailed feedback on the performance of an algorithm and that this feedback should be in the language of the problem: Our goal is to characterize model errors or the differences between models in the feature space. We provide a framework for this that allows different algorithms to be used as the discovery engine and we consider two approaches: (1) a classification strategy where we use a standard rule learner such as C5; (2) a descriptive paradigm where we use a new discovery algorithm: a contrast set miner. We show that C5 suffers from several problems that make it unsuitable for this task.


Compiler Optimizations for Embedded VLIW Processors
Rainer Leupers

Recent embedded processor families, such as TI C62xx or Philips Trimedia, are VLIW processors with multiple functional units working in parallel. The design of compilers capable of generating high-quality code for such processors is a challenging task. High code quality is extremely important for embedded systems, due to tight performance and code size constraints. This motivates the design of new code optimization techniques beyond classical compiler technology. Such techniques must be capable of exploiting parallelism at the instruction level, as well as special architectural features, such as SIMD instructions. On the other hand, compilation speed is only a secondary issue. In this presentation, we give an overview of embedded compiler techniques recently developed at the computer engineering group of the University of Dortmund. These include optimization of if-statements with conditional instructions, exploitation of SIMD instructions, and function inlining under code size constraints. We give experimental results for real embedded processors indicating the code quality gain that can be achieved. Additionally, compiler infrastructure issues and a hot new topic, low-power code generation, will be briefly covered.


Robust Multi-scale Image and Data Analysis by Maximum Entropy
Joachim M. Buhmann

The challenge in data analysis and in particular in computer vision is to develop robust models and efficient algorithms for inference. Data analysis methods have to provide (i) a quality criterion for structures in data, e.g., clusters, hierarchies or projections, (ii) an optimization method to find satisfactory structures and (iii) a validation principle to estimate the uncertainty of the inferred structures. This scientific program can be paradigmatically carried out in the image segmentation -- one of the essential first steps from subsymbolic pixel processing to symbolic image interpretation. Image segmentation techniques are characterized by three different scales: (i) the resolution scale of an image; (ii) the quality/cost scale of a segmentation solution; (iii) the complexity scale of segmentation models. All three scales influence each other in real world image segmentation problems. Similar scale coupling problems arise in other vision problems like stereo vision or motion analysis. Stochastic approximation based on maximum entropy inference provides a mathematical framework to couple all three scales and to develop optimization algorithms which generalize over the pixel noise in images. Large deviation arguments from statistical learning theory yields a stopping criterion for the approximation accuracy. This criterion which is parameterized by a computational temperature determines an optimal number of segments and, thereby, provides a solution to the model order selection problem.


Bistro: a Platform for Building Scalable Wide-Area Upload Applications
Leana Golubchik

Hot spots are a major obstacle to achieving scalability in the Internet. At the application layer, hot spots are usually caused by either (a) high demand for some data or (b) high demand for a certain service. This high demand for data or services, is typically the result of a real-life event involving availability of new data or approaching deadlines; therefore, relief of these hot spots may improve quality of life. Existing types of communication employed by data transfer applications can be categorized into four types, namely, one-to-one, one-to-many, many-to-many, and many-to-one. To the best of our knowledge, scalability and efficiency of many-to-one data transfer (i.e., upload) applications have been largely ignored. Thus, in this talk, we will introduce Bistro, a platform for building scalable wide-area upload applications, as well as suggest a number of open research problems within the Bistro framework. Examples of upload applications include submission of income tax forms, conference paper submission, proposal submission through the NSF FastLane system, homework and project submissions in distance education, voting in digital democracy applications, voting in interactive television, and many more. Our intent for deploying the Bistro platform is not to rely on adding resources (such as hosts) to the Internet. Rather, we envision that people will want to install Bistro on their hosts on the public Internet and contribute their resources to the overall Bistro infrastructure because it will improve their performance as well. Since anyone will be allowed to install a Bistro server, security becomes a great concern since sensitive information such as income tax forms will be uploaded through the Bistro system. Hence, in this talk, we will also describe a security protocol which provides the needed integrity, privacy, authentication, and non-repudiation for our upload framework over untrusted intermediaries which are an integral part of our Bistro framework. It does this under the constraints of performance and scalability requirements. Leana Golubchik is currently an Assistant Professor in the Department of Computer Science at the University of Maryland at College Park. From Fall of 1995 until Summer of 1997, she was an Assistant Professor in the Department of Computer Science at Columbia University. Her research interests include computer systems modeling & performance evaluation, Internet-based computing, and multimedia storage systems. Leana received her Ph.D. degree from the Computer Science Department at the University of California, Los Angeles in 1995.


Estimating the Size of Populations
Stephen Fienberg

Capture-recapture methods were originally developed for use in fisheries and wildlife biology for estimating the size of a closed (fixed) population. Later, they were adapted for use in connection with human populations. Applications to epidemiology came more recently, especially those involving multiple-record systems. Over the past 25 years, efforts have been made to link this methodology to the more traditional literature on loglinear models for contingency tables. This talk focuses on methods associated with data from multiple sources or lists and illustrates how loglinear and related models (e.g., the Rasch model from educational testing) can allow for estimation in the presence of both dependence among lists and heterogeneity among individuals or units in the population. We apply the methods to a number of examples including (1) the ascertainment of diabetes, (2) adjusting the U.S. decennial census counts, and (3) estimating the size of the WWW. The latter problem requires some special statistical features for scaling up analyses.


Indexing Mobile Objects
Dimitrios Gunopulos

We present external memory data structures to index mobile objects in one and two dimensions. The problem is motivated by real life applications in traffic monitoring, intelligent navigation and cellular communications systems. The specific problems we consider are range queries over the objects location in the future, and nearest neighbor queries on the future locations of the objects. For the 1-dimensional range query problem, we give (i) a dynamic, external memory algorithm with guaranteed worst case performance and linear space and (ii) a practical approximation algorithm which has linear space and expected logarithmic query time. We also give an algorithm with guaranteed logarithmic query time for a restricted version of the problem. We present extensions of our techniques to two dimensions. In addition we give a lower bound on the number of I/O's needed to answer the d-dimensional problem. We consider two versions of nearest neighbor queries depending on whether the temporal predicate is a time instant or a time interval, and we present techniques to answer both kinds of queries. Initial experimental results and comparisons to traditional indexing approaches are also included. Parts of this work appeared in PODS 99. This is joint work with G. Kollios and V. Tsotras.


To Catch a Thief-The Markov Modulated Poisson Process As a Tool for Detecting External Fraud
Steven Scott

The Markov modulated Poisson process (MMPP) arises naturally as a model for irregularly spaced time series data that may be contaminated by an outside source. I will describe the MMPP, present a computationally effective Gibbs sampler capable of simulating draws of MMPP parameters from their posterior distribution, and demonstrate the ability of the MMPP to detect fraud in a simple simulated data set. I will also discuss a data set collected by AT&T Labs in order to detect toll fraud in international phone traffic. The talk will conclude with a discussion of issues involved in incorporating the MMPP in a more complicated model that accounts for features present in the AT&T data set


Planning as Heuristic Search
Blai Bonet

The formulation of planning as heuristic search with heuristics derived from problem representations has turned out to be a fruitful approach for classical planning. In this talk, we present such formulation and its extensions in the context of planning with incomplete information. Planning with incomplete information can be formulated as a problem of search in belief space, where belief states can be either sets of states or more generally probability distribution over states. While this formulation is not particularly novel, the main point of the talk is to make it explicit, to test it over a number of domains, and to extend it to tasks like planning with sensing where the standard search algorithms do not apply. The resulting planner appears to be competitive with the most recent conformant and contingent planners (e.g., \cgp, \sgp, and \cmbp) while at the same time is more general as it can handle probabilistic actions and sensing, different action costs, and epistemic goals.


Extreme Dimensionality Reduction in Text Learning
Gary Boone

An exciting array of information management applications would be possible if computers could effectively understand text-based information. Significant progress has been made by employing statistical techniques that classify documents based on word frequencies and information theoretic weightings for selecting feature words that best predict task utility. A fundamental problem with these vector space approaches is the large dimensionality of the resulting learning space. With typical vocabularies and therefore representational dimensionalities containing 10,000-40,000 words, applicable learning techniques are limited. Feature selection techniques can reduce these values, but performance typically suffers as words are eliminated below a few thousand feature words. In this talk, I will present a new method that generates features with fast and accurate performance with fewer than 100 features and in many cases less than 10. The approach emphasizes learning a new feature space into which training and query data are subsequently recast. In contrast to feature selection approaches, no feature words are discarded. As a result, dimensionality can be reduced while avoiding the problem of "feature invisibility" which occurs when no feature words are present in a document. The method has the advantages that it is faster than comparable techniques, results in equal or more accurate performance of classification tasks, and enables the application of a broad range of learning techniques in the new, low dimensional space. Performance results will be shown for email, newsgroup, and web page classification tasks. Comparisons to other common techniques will be given, including Naive Bayes, standard Information Retrieval clustering and non-clustering techniques, Latent Semantic Analysis, and Support Vector Machines.


Compiler Optimizations for Embedded VLIW Processors
Simone Santini

Automatic video analysis systems are getting better and more diffused, thanks to the availability of significant computing power and good quality cameras at an ever decreasing power. With the advent of web cameras that can be connected to the serial port of a computer and which cost less than $100, the amount of analyzed visual information can be expected to grow by several orders of magnitude. The computer vision community is largely unprepared to deal with such massive amounts of data. A new area of research is therefore being outlined at the intersection between databases and computer vision. The necessity to search video poses new challenges to database research as well because, unlike traditional databases in which every record is meaningful, most frames and features in a video contain no useful data. This talk will introduce the work that is currently going on at Praja on event detection and storage. We identified events, whose definition is domain dependent as a suitable granulatity for video storage and retrieval. We work in a surveillance scenario, and are designing tools for the declarative specification of events of interest, their composition, storage and retrieval.


Fixing the Computer World with bidirectional connections
Theodor Holm Nelson

Note: "Xanadu", "ZigZag" and "Floating World" are software trademarks.

The cosmology and furniture of the existing computer world are all artificial constructs. Computers have no nature until it is given to them. They would not be hierarchical or have "applications" and three-letter extensions, except that somebody decided to make them that way. Word processing, spreadsheet and database are about as natural as General Motors and Peanut Butter Crunch ice cream. It could all be much better, and there is an infinity of other possibilities. But anything else will *also* be an artificial world -- and must fight to exist in the continuous war of business rivalry and standardization politics. The structures I have worked on are all based on bidirectional linking. I will discuss in particular the Xanadu designs (esp. xu88/Udanax Green), ZigZag and Floating World, all of which are or will be Open Source.

 


Presence Technology: The Evolution of Multiple Perspective Interactive Video
Dr. Simone Santini

Presence Technology (PT) is targeted to the needs of people who want to be part of a remote, live environment. Presence systems blend component technologies like computer vision, signal understanding, heterogeneous sensor fusion, live-media delivery, tele-presence, databases, and multimedia information systems into a novel set of functionality that enables the user to perceive, move around, inquire about, and interact with the remote, live environment through her reception and control devices. PT creates the opportunity to perform different tasks: watch an event, tour and explore a location, meet and communicate with others, monitor the environment for a potential situation, perform a query on the perceived objects and events, and recreate past observations. Technically, the framework offers computer-mediated access to multi-sensory information in an environment, integrates the sensory information into a situation model of the environment, and delivers, at the user's request, the relevant part of the assimilated information through a multi-modal interface. PT is an extension of the Multiple Perspective Interactive Video project at the Visual Computing Laboratory, University of California, San Diego.


Entropy minimization for learning and video understanding
Matthew Brand

Consider the scientist's intuition that the best hypothesis should give the simplest and clearest account of the data. This can be neatly formalized in terms of entropy. I'll introduce a data-modeling framework in which one learns by minimizing all entropies associated with a hypothesis. This framework contains several powerful methods as special cases: deterministic annealing; maximum likelihood; a variant of MaxEnt; and a new method I'll call maximum structure, which blends description length minimization and free energy minimization. Maximum structure estimators seeks an optimal separation between the essential and accidental structure of a dataset, typically yielding a highly predictive model whose internal structure can be interpreted as a theory of the data-generating mechanism. The maximum structure case also has interesting connections to Kolmogorov complexity and to the theory of random graphs. When we replace the M step of expectation-maximization with the maximum-structure estimator, we obtain a considerably more efficient and powerful deterministic annealing algorithm that simultaneously estimates the structure and parameters of hidden-variable models--yielding quasi-global optima with no speculative or wasted computation. Using simple probabilistic models trained in this unsupervised learning framework, we are able to tackle applications that are difficult even with supervised learning. I'll demonstrate a variety of audio and video applications in which entropically estimated models support annotation, low bit-rate coding, anomaly detection, and even reconstruction of 3D structure and motion. Time permitting, I'll also show some results in animation.

 


Challenges of Human Genetics
Anne Spence

The Human Genome Project is proving to be a true "good news -- bad news" situation for the human genetics community. The amount of data readily avail-able is fantastic and can be applied directly to questions under investigation as will be illustrated by examples from our research. Good news. On the other hand there are no tool in place that permit rapid and efficient know-ledge of what is in the data bases let along the incorporation of the information in systematic ways. And how do we analyze this vast volume of data with existing methods? Bad news. A brief historical perspective will be presented and the current challenges for the development of new tools discussed.

 

 


Research on Adaptive Web Content Delivery
Jin-Lin

In this talk I will give a brief introduction to my current research on adaptive web content delivery in Microsoft Research China. My system is mainly composed of three modules: website understanding, web user analysis and decision engine for adaptation. I will first present a Function-based Object Model (FOM) towards website understanding, which is followed by individual web user goal detection. I will then give an introduction to my work on user weblog mining and website structure improvement. An Abstract Content Representing Structure (ACRES) model is proposed for the decision engine to make adaptation decision. Demos of Adaptation results will also be presented at the end of the talk.

 

 


Feature Subset Selection for Unsupervised Learning Applied to Content-Based Image Retrieval
Jennifer Dy

In this talk I will give a brief introduction to my current research on adaptive web content delivery in Microsoft Research China. My system is mainly composed of three modules: website understanding, web user analysis and decision engine for adaptation. I will first present a Function-based Object Model (FOM) towards website understanding, which is followed by individual web user goal detection. I will then give an introduction to my work on user weblog mining and website structure improvement. An Abstract Content Representing Structure (ACRES) model is proposed for the decision engine to make adaptation decision. Demos of Adaptation results will also be presented at the end of the talk.

 


Dynamically Detecting Likely Program Invariants
Michael Ernst

Explicitly stated program invariants can help programmers by characterizing certain aspects of program execution and identifying program properties that must be preserved when modifying code. In practice, these invariants are usually absent from code. An alternative to expecting programmers to annotate code with invariants is to automatically infer invariants from the program itself.

This talk describes dynamic techniques for discovering invariants from execution traces; the essential idea is to look for patterns in and relationships among variable values over a set of executions. An implementation has indicated that the approach is both effective -- successfully rediscovering formal specifications -- and useful -- discovering invariants that assisted a software evolution task.

A naive implementation of dynamic invariant inference suffers both from excessive runtime and also from reporting irrelevant invariants and failing to report relevant ones. I will discuss several improvements that together make the system usable: statistical confidence measures, eliminating unused polymorphism, suppressing redundant and coincidentally true properties, and limiting which variables are compared to one another. I will also show how to extend dynamic invariant inference to pointer-directed data structures, which requires traversal of data structures and discovery of conditional rather than universally true invariants.

 

 

Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425


Submissions to: Front Office