CAREER:
Peer-Based Data Integration and Sharing of Heterogeneous Sources
NSF IDM 0238586
Principal Investigator
Chen Li
Information and Computer Science, University
of California, Irvine
CA 92697
Tel: (949) 824-9470, Fax: (949) 824-4056
chenli@ics.uci.edu,
http://www.ics.uci.edu/~chenli/
Keywords: Peer-based data
integration, data sharing, Raccoon Project
Project Summary
This proposal studies research
challenges to support efficient, effective data integration and sharing. It
consists of four important research thrusts. (1) How to describe various source
contents and allow users to formulate queries easily. We combine two existing
approaches to data integration, and provide two querying modes for users to
pose queries. (2) How to manage source metadata globally. To allow each peer to
know what data other sources have, we consider three architectures to manage
source metadata. We study the efficiency, scalability, and freshness guarantee
of these architectures. (3) How to deal with data heterogeneity, in particular,
link duplicate records from different sources. We study algorithms to learn
similarity measurements between records in both the single-attribute case and
the multi-attribute case. (4) How to optimize distributed queries that involve
multiple sources. We study how to model source resources, minimize
communication costs, and optimize a query when sources have limited
capabilities.
Publications and Products
·
Minimizing
Data-Communication Costs by Decomposing Query Results in Client-Server
Environments. Jia Li, Rada Chirkova, Chen Li. UCI ICS Technical Report, 2003.
Submitted for publication.
·
On
Containment of Conjunctive Queries with Arithmetic Comparisons. Foto Afrati,
Chen Li, Prasenjit Mitra. UCI ICS Technical Report, 2003. Submitted for
publication.
·
RACCOON:
A Peer-Based System for Data Integration and Sharing. Chen Li, Jia Li, Qi
Zhong. UCI ICS Technical Report, 2003.
Submitted for publication.
·
Describing
and Utilizing Constraints to Answer Queries in Data-Integration Systems. Chen
Li. To appear in IJCAI 2003 workshop on Information Integration on the Web.
·
Materializing
Views with Minimal Size to Answer Queries. Rada Chirkova and Chen Li. ACM
PODS'2003
Project Impact
·
Data integration of autonomous
data sources is critical to information management in e-commerce, scientific
data management, digital government, and homeland security.
·
The research results developed in
this project will be applicable in a wide spectrum of data-integration-related
applications from traditional DBMS systems to broader topics.
·
The research challenges provide
great opportunities to teach students existing database-management techniques
as well as the unknown territories.
·
A system prototype will be helpful
for the dissemination of the research results.
·
Research results will be
disseminated in the form of publications, publicly available data sets,
softwares, and demos. The personnel
of the project will give presentations in conferences and colloquia.
·
Many research results will be very
interesting to industries, which can help us disseminate our work to real
products.
In addition, we are in close tie with the UCI Health
Sciences to tackle research problems related to this proposal. Our successful
results will help bioinformatics researchers manage their data more
effectively, and broaden the impact of database techniques in a wide range of
applications.
Goals, Objectives and Targeted Activities
Year 1:
·
Design the system architecture to
support source-content description, query formulation, and metadata management. Select an application domain (e.g., real
estate) to conduct experiments.
·
Data heterogeneity: Develop
efficient string join algorithms for data linkage.
·
Distributed query optimization: minimize
communication cost.
·
Continue the collaboration with
other units at UC Irvine.
Year 2:
·
Data heterogeneity: Learn single-attribute
similarity metrics
·
Metadata management: choose right
metadata specificity, and study different mechanisms to manage and update
metadata.
·
Build the system prototype, on
which later research can be conducted and experimented.
Year 3:
·
Data heterogeneity: Learn multi-attribute
merging rules.
·
Study source description and query
formulation.
·
Develop plan-generation algorithms
in the two querying modes and integrate them with the user interface.
·
Continue building the system
prototype.
Year 4:
·
Metadata management: study
security and privacy of source data.
·
Distributed query optimization: model
source resources, and optimize queries in case of source restrictions.
·
Continue building the system
prototype.
Year 5:
·
System integration and evaluation
·
Study new open problems that
appear in the first four years.
Area Background
Traditional data-integration
systems adopt a centralized mediation architecture, in which a user poses a
query to a mediator that retrieves data from underlying sources to answer the query.
Recent database applications are seeing the emerging need to support data
integration in distributed, peer-based environments. In such an environment, autonomous peers
(sources) connected by a network are willing to exchange data and services with
each other. For instance, several
labs at UC Irvine Health Sciences are conducting research related the Human
Genome Project, and they are very willing to share their experimental results. The sharing needs a distributed
infrastructure, in which each lab provides its own data for other participants,
as well as accesses information from other labs. As another example, recent terrorist
attacks show the great need of new intelligence-sharing technologies, which can
strengthen the ability to prevent, detect, and respond to existing and emerging
homeland safety threats.
Potential Related Projects
·
The
BestPeer Project at the National University of Singapore
·
The Hyperion Project at the University
of Toronto
·
The
Piazza Project at the University of Washington
Project Websites
http://www.ics.uci.edu/~raccoon/
This website gives an overview of the project, including its
motivation and goal. It provides the publications and software releases from
the project. It also gives an internal link for the developers to exchange and
archive the related documents.