Graph-based Disambiguation Framework (GDF)
Analyzing Entity-Relationship Graph for domain-independent Entity Resolution & Disambiguation.
Department of Computer Science
University of California, Irvine
http://www.ics.uci.edu/~dvk/GDF/
Overview
The effectiveness of data-driven technologies as decision support tools,
data exploration and scientific discovery tools is closely tied to the
quality of data on which such techniques are applied. It is well recognized
that the outcome of the analysis is only as good as the data on which
the analysis is performed. That is why today organizations spend a tangible
percent of their budgets on cleaning tasks such as removing duplicates,
correcting errors, filling missing values, to improve data quality prior
to pushing data through the analysis pipeline.
Given the critical importance of the problem, many efforts, in both industry
and academia, have explored systematic approaches to addressing the cleaning
challenges. Our work focuses specifically on the disambiguation challenge
that arises because objects in the real world are referred to using references
or descriptions that are not always unique identifiers of the objects,
leading to ambiguity.
We have proposed a novel methodology to disambiguation that relies upon the
observation that many real-world datasets are relational in nature
and contain not only information about entities but also relationships among them,
knowledge of which can be used to disambiguate among representations more effectively.
Our objective is to develop a principled, domain-independent methodology to
exploit the entity-relationship graph of the dataset, and specifically
relationships, for disambiguation that is self-tuning and requires minimal
intervention by analysts. We apply our methodology on the problems of
Entity Resolution and Web Entity Search.
Keywords
Entity Resolution, Entity Search, Disambiguation, Information Quality, Data Cleaning,
Social Network Analysis.
Principle Investgators
Ph.D. Students
News
- The project has recieved a $50,000 award from Google to work on
"Graph based Disambiguation Framework for Web People Search" (11/17/2009).
Publications
-
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems.
Zhaoqi Stella Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra.
In Proc. of ACM SIGMOD Int'l Conf. on Management of Data (ACM SIGMOD),
June 29-July 2, 2009.
[Download Paper]
-
Exploiting Web querying for Web People Search in WePS2.
Rabia Nuray-Turan, Zhaoqi Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra.
In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference,
April, 2009.
[Download Paper]
-
WEST: Modern Technologies for Web People Search.
Dmitri V. Kalashnikov, Zhaoqi Chen, Rabia Nuray-Turan, Sharad Mehrotra, and Zheng Zhang.
In Proc. of IEEE International Conference on Data Engineering (IEEE ICDE), demo publication,
March 29 - April 4, 2009.
[Download Paper]
-
Web people search via connection analysis.
Dmitri V. Kalashnikov, Zhaoqi Chen, Rabia Nuray-Turan, and Sharad Mehrotra.
In IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 20(11), November 2008
[Download Paper]
-
Towards breaking the quality curse. A web-querying approach to Web People Search.
Dmitri V. Kalashnikov, Rabia Nuray-Turan, and Sharad Mehrotra.
In Annual International ACM SIGIR Conference,
July 20-24, 2008.
[Download Paper]
-
Adaptive Graphical Approach to Entity Resolution.
Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra.
In Proc. of ACM IEEE Joint Conference on Digital Libraries (ACM IEEE JCDL),
June 17-23, 2007.
[Download Paper]
-
Self-tuning in Graph-based Reference Disambiguation.
Rabia Nuray-Turan, Dmitri V. Kalashnikov, and Sharad Mehrotra.
In Proc. of Int'l Conf. on Database Systems for Advanced Applications (DASFAA),
Apr 9-12, 2007.
[Download Paper]
-
Disambiguation Algorithm for People Search on the Web.
Dmitri V. Kalashnikov, Stella Chen, Rabia Nuray, Sharad Mehrotra, and Naveen Ashish.
In Proc. of IEEE International Conference on Data Engineering (IEEE ICDE), short publication,
April 16-20, 2007.
[Download Paper]
-
Domain-independent data cleaning via analysis of entity-relationship graph.
Dmitri V. Kalashnikov and Sharad Mehrotra
In ACM Transactions on Database Systems (ACM TODS), June 2006
[Download Paper]
-
Exploiting relationships for object consolidation.
Zhaoqi Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra.
In Proc. of International ACM SIGMOD Workshop on Information Quality in
Information Systems (ACM IQIS),
June 13-17, 2005.
[Download Paper]
-
Exploiting relationships for domain-independent data cleaning.
Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen.
In Proc. of SIAM International Conference on Data Mining (SIAM
Data Mining),
April 21--23, 2005.
[Download Paper]
Datasets
- CiteSeer: collection of research publications
- DBLP: collection of bibliographic entries
- arXive hep-th: KDD Cup 2003 publication dataset, hep-th portion of arXive
- Cora: a citation dataset from RIDDLE data repository
- Cora: a citation dataset from Andrew McCallum's data repository
- U.S. Census Names: frequently occurring first names and surnames from the 1990 Census
- IMDb: collection of movie-related entries
- Stanford Movie Dataset: collection of movie-related entries
- Web Disambiguation: collection of labeled webpages used by Bekkerman and McCallum in WWW'05
- WEPS Corpus: collection of labeled webpages used by Artiles, Gonzalo, and Verdejo in SIGIR'05
- SPOKE Challenge: (registration is required) collection of labeled webpages for SPOKE Challenge
Back to Kalashnikov's homepage
© 2009 Dmitri V. Kalashnikov. All Rights Reserved.
|