CAREER: Peer-Based Data Integration and Sharing of Heterogeneous Sources

NSF IDM 0238586

Principal Investigator

Chen Li
Information and Computer Science, University of California, Irvine CA 92697

Tel: (949) 824-9470, Fax: (949) 824-4056

chenli@ics.uci.edu, http://www.ics.uci.edu/~chenli/


Keywords:  Peer-based data integration, data sharing, Raccoon Project

Project Summary

This proposal studies research challenges to support efficient, effective data integration and sharing. It consists of four important research thrusts. (1) How to describe various source contents and allow users to formulate queries easily. We combine two existing approaches to data integration, and provide two querying modes for users to pose queries. (2) How to manage source metadata globally. To allow each peer to know what data other sources have, we consider three architectures to manage source metadata. We study the efficiency, scalability, and freshness guarantee of these architectures. (3) How to deal with data heterogeneity, in particular, link duplicate records from different sources. We study algorithms to learn similarity measurements between records in both the single-attribute case and the multi-attribute case. (4) How to optimize distributed queries that involve multiple sources. We study how to model source resources, minimize communication costs, and optimize a query when sources have limited capabilities.

Publications and Products

·          Minimizing Data-Communication Costs by Decomposing Query Results in Client-Server Environments. Jia Li, Rada Chirkova, Chen Li. UCI ICS Technical Report, 2003. Submitted for publication.

·          On Containment of Conjunctive Queries with Arithmetic Comparisons. Foto Afrati, Chen Li, Prasenjit Mitra. UCI ICS Technical Report, 2003. Submitted for publication.

·          RACCOON: A Peer-Based System for Data Integration and Sharing. Chen Li, Jia Li, Qi Zhong. UCI ICS Technical Report, 2003. Submitted for publication.

·          Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems. Chen Li. To appear in IJCAI 2003 workshop on Information Integration on the Web.

·          Materializing Views with Minimal Size to Answer Queries. Rada Chirkova and Chen Li. ACM PODS'2003

Project Impact

·          Data integration of autonomous data sources is critical to information management in e-commerce, scientific data management, digital government, and homeland security.

·          The research results developed in this project will be applicable in a wide spectrum of data-integration-related applications from traditional DBMS systems to broader topics.

·          The research challenges provide great opportunities to teach students existing database-management techniques as well as the unknown territories.

·          A system prototype will be helpful for the dissemination of the research results.

·          Research results will be disseminated in the form of publications, publicly available data sets, softwares, and demos.  The personnel of the project will give presentations in conferences and colloquia.

·          Many research results will be very interesting to industries, which can help us disseminate our work to real products.

 

In addition, we are in close tie with the UCI Health Sciences to tackle research problems related to this proposal. Our successful results will help bioinformatics researchers manage their data more effectively, and broaden the impact of database techniques in a wide range of applications.

Goals, Objectives and Targeted Activities

Year 1:

·          Design the system architecture to support source-content description, query formulation, and metadata management.  Select an application domain (e.g., real estate) to conduct experiments.

·          Data heterogeneity: Develop efficient string join algorithms for data linkage.

·          Distributed query optimization: minimize communication cost.

·          Continue the collaboration with other units at UC Irvine.

 

Year 2:

·          Data heterogeneity: Learn single-attribute similarity metrics

·          Metadata management: choose right metadata specificity, and study different mechanisms to manage and update metadata.

·          Build the system prototype, on which later research can be conducted and experimented.

 

Year 3:

·          Data heterogeneity: Learn multi-attribute merging rules.

·          Study source description and query formulation.

·          Develop plan-generation algorithms in the two querying modes and integrate them with the user interface.

·          Continue building the system prototype.

 

Year 4:

·          Metadata management: study security and privacy of source data.

·          Distributed query optimization: model source resources, and optimize queries in case of source restrictions.

·          Continue building the system prototype.

 

Year 5:

·          System integration and evaluation

·          Study new open problems that appear in the first four years.

Area Background

Traditional data-integration systems adopt a centralized mediation architecture, in which a user poses a query to a mediator that retrieves data from underlying sources to answer the query. Recent database applications are seeing the emerging need to support data integration in distributed, peer-based environments.  In such an environment, autonomous peers (sources) connected by a network are willing to exchange data and services with each other.  For instance, several labs at UC Irvine Health Sciences are conducting research related the Human Genome Project, and they are very willing to share their experimental results.  The sharing needs a distributed infrastructure, in which each lab provides its own data for other participants, as well as accesses information from other labs.  As another example, recent terrorist attacks show the great need of new intelligence-sharing technologies, which can strengthen the ability to prevent, detect, and respond to existing and emerging homeland safety threats.

Potential Related Projects

·          The BestPeer Project at the National University of Singapore

·          The Hyperion Project at the University of Toronto

·          The Piazza Project at the University of Washington

Project Websites

http://www.ics.uci.edu/~raccoon/

This website gives an overview of the project, including its motivation and goal. It provides the publications and software releases from the project. It also gives an internal link for the developers to exchange and archive the related documents.