Data Cleaning/ Object Consolidation / Entity
Resolution/ Reference Disambiguation Bibliography
This list is partial and evolving. Please let us know if
you find out any missing publication.
Grouped
by Publication Date
2007 | 2006 | 2005 | 2004 | 2003 | 2002| Older |
2007
- P. Kanani and A. McCallum. Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes. IIWEB, AAAI workshop, 2007.
- S. Yan, D. Lee, M-Y Kan and L.C. Giles.
Adaptive sorted neighborhood methods for efficient record linkage. JCDL 2007. [link]
- S. Chaudhuri, A DasSarma, V. Ganti and R. Kaushik.Leveraging Aggregate Constraints For Deduplication. SIGMOD, 2007. [link]
- S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP and CNLP, 2007. [link]
- Y. Chen and J. Martin. Towards Robust Unsupervised Personal Name Disambiguation. EMNLP and CNLP, 2007. [link]
- A. Yates and O. Etzioni. Unsupervised Resolution of Objects and Relations on the Web. NAACL HLT, 2007. [link]
- Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive Graphical Approach to Entity Resolution. JCDL, 2007. [link]
- X. Yin, J. Han, and P.S. Yu. Object Distinction: Distinguishing Objects with Identical Names. ICDE, 2007. [link].
- D. V.
Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra
and N. Ashish. Disambiguation
algorithm for people search on the web. In the proceedings of IEEE ICDE 20007 Conference. April, 2007
- R.
Nuray-Turan, D. V. Kalashnikov and S. Mehrotra. Self-tuning in graph-based reference
disambiguation. In the Proceedings
of DASFAA 2007. April, 2007.[link]
- B. On,
D. Lee. Scalable Name
Disambiguation using Multi-level Graph Partition. In SIAM
SDM, April 2007 [ link]
- B. On, N. Koudas, D. Lee, and D. Srivastava. Group
Linkage. In ICDE 2007.
April, 2007.[link]
- W. Shen, P. DeRose, L. Vu, A. Doan, and R. Ramakrishnan.
Source-aware Entity Matching: A
Compositional Approach. In ICDE 2007,
April 2007. [link]
- P. Kanani, A. McCallum, and C. Pal. Improving author coreference by
resource-bounded information gathering from the web. In IJCAI. 2007. [link]
2006
- R. Bunes cu and Marius Pasca. Using Encyclopedi Knowledge for Named Entity Disambiguation. EACL, 2006. [link]
- M. Weis and F. Naumann.
Detecting Duplicates in Complex XML Data. ICDE, 2006. [link]
- D. Bollegala, Y. Matsuo, and M. Ishizuka. Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases. ECAI, 2006.[llink]
- F. Folino, G. Manco, and L. Pontieri. Effective Incremental Clustering for duplicate detection in Large Databases. IDEAS, 2006. [link]
- D. V.
Kalashnikov and S. Mehrotra. Domain-independent data cleaning via
analysis of entity-relationship graph. In ACM TODS. June, 2006.[link]
- V. Sehgal, L. Getoor, and P. Viechniki. Entity
resolution in geospatial data integration. In GIS, 2006. [link]
- M. Bilgic. L. Licamele, L. Getoor, and B. Schneiderman.
D-dupe: An interactive tool for
entity resolution in social networks. In IEEE VAST ,2006.[link]
- B. On, E. Elmacioglu, D.
Lee, J. Kang, and J. Pei. Improving grouped-entity resolution
using quasi-cliques. In ICDM
2006. December, 2006 [link]
- I.
Bhattacharya and L. Getoor. Query-time entity resolution. In SIGKDD. 2006 [link]
- P. Singla and P. Domingos. Entity resolution with markov logic. In
ICDM 2006. December, 2006. [link]
- M. Bilenko, B. Kamath, and R.
J. Mooney. Adaptive Blocking: Learning
to Scale Up Record Linkage and Clustering. In ICDM.
2006. [link]
- A. Culotta and A. McCallum. Tractable learning and inference with higher-order
representations. In ICML
Workshop on Open Problems in Statictical
Relational Learning. 2006. [link]
- B. On, E. Elmacioglu. D. Lee, J.
Kang, and J. Pei. An
effective approach to entity resolution problem using quasi-clique and its
application to digital libraries.
In JCDL. June, 2006.[link]
- E. Minkov and W. W. Cohen. An Email and Meeting Assistant using Graph Walks. In CEAS-2006. [link]
- E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual Search and Name Disambiguation in Email using Graphs. In SIGIR-2006. [link]
- Y. F.
Tan, M-Y. Kan
and D. Lee. Search Engine Driven
Author Disambiguation. In JCDL.
June, 2006. [link]
- I. Mansuri and S. Sarawagi. A
system for integrating unstructured data into relational databases. In
ICDE, 2006.[link]
- I. Bhattacharya and L. Getoor.
A latent dirichlet
model for unsupervised entity resolution. In SIAM SDM. 2006.
[link]
- L. Bolelli, S. Ertekin, C. L.Giles. Clustering Scientific
Literature Using Sparse Citation Graph Analysis.
10th European Conference on Principles and Practice of Knowledge
Discovery in Databases (PKDD 2006): 30-41, 2006. [link]
- J. Huang, S.
Ertekin, C. L. Giles. Efficient
Name Disambiguation for Large-Scale Databases. 10th
European Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD 2006): 536-544, 2006. [link]
- D. Menestrina, O. Benjelloun,
H. Garcia-Molina. Generic Entity Resolution with Data
Confidences. In First Int'l
VLDB Workshop on Clean Databases,,2006.[link]
- O. Benjelloun, H. Garcia-Molina, H. Kawai, T. E. Larson,
D. Menestrina, Q. Su, S. Thavisomboon,
J. Widom. Generic
Entity Resolution in the SERF Project. IEEE Data Engineering Bulletin, June 2006. [link]
- S. Chaudhuri, V. Ganti, and R. Kaushik. A
primitive operator for similarity joins in data cleaning. In ICDE, 2006 . [link]
- A. Arasu, V. Ganti, R. Kaushik.Efficient exact set-similarity joins. In
VLDB, 2006.[link]
- Y. Zhuang and
L. Chen. In network Outlier
Cleaning for Data Collection in Sensor Networks. In CleanDB Workshop.[link]
- J. Hassell, B. Aleman-Meza, and
I. B. Arpinar. Ontology-driven automatic entity
disambiguation in unstructured text. In 5th International Semantic
Web Conference (ISWC2006), 2006. [link]
2005
- D. V.
Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain
independent data cleaning. In SDM 2005. 2005. [link]
- Z.
Chen, D. V. Kalashnikov and S. Mehrotra. Exploiting relationships for object
consolidation. In IQIS. 2005.
[link]
- L. Jin
and C. Li. Selectivity Estimation
for Fuzzy String Predicates in Large Datasets. In VLDB, 2005. [link]
- M. Bilenko, S. Basu, and M. Sahami. Adaptive
Product Normalization: Using Online Learning for Record Linkage in
Comparison Shopping. In ICDM. 2005. [link]
- W. Shen, X. Li and A. Doan. Constraint-based entity matching. In AAAI 2005. 2005.
- A. Culotta and A. McCallum. Joint
deduplication of multiple record types in
relational data. In CIKM. 2005. [link]
- A.
McCallum, K. Bellare and F. Pereira. A conditional random field for
discriminatively-trained finite-state string edit distance. In UAI. 2005. [link]
- R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW. 2005. [link]
- J.
Kang, D. Lee and P. Mitra. Identifying value mappings for data integration: an unsupervised
approach. In WISE. 2005. [link]
- A. Al-Lawati. D. Lee. And P. McDaniel. Blocking-aware private record linkage. In IQIS. 2005. [link]
- D. Lee,
B. On. J. Kang and S. Park. Effective
and scalable solutions for mixed and split citation problems in digital
libraries. In IQIS. 2005. [link]
- I. Bhattacharya and L. Getoor.
Relational clustering for
multi-type entity resolution. In
MRDM.2005. [link]
- J. Artiles, J. Gonzalo,
an S. Sekine. A testbed for people searching
strategies in the WWW. In SIGIR.
2005.[link]
- S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. Narasayya, and T.
Vassilakis. Data
cleaning in Microsoft SQL server. In
SIGMOD, 2005. [link]
- X. Dong, A. Y. Halevy,
and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. [link]
- S. Hill. Social
network relational vectors for anonymous identity matching. In IJCAI, 2005. [link]
- R. Holzer, B. Malin and L. Sweeney. Email alias detection using social network analysis. In SIGKDD Workshop, 2005.[link]
- B. Malin. Unsupervised
name disambiguation via social network similarity. In Workshop on Link Analysis,
Counterterrorism, and Security,
2005. [link]
- H. Han, H. Zha, C. L. Giles. Name
disambiguation in author citations using a K-way spectral clustering
method. Joint Conference on Digital Libraries 2005 (JCDL
2005): 334-343, 2005.[link]
- S. Chaudhuri, V. Ganti, R. Motwani. Robust
identification of fuzzy duplicates. In ICDE, 2005. [link]
- B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong,
and A. Kolobov. Blog: Probabilistic models with unknown objects. In IJCAI, 2005. [link]
- X. Li, P. Morie, and D.Roth. Semantic
integration in text: From ambiguous names to identifiable entities. AI Magazine. Special issue on semantic integration.
2005. [link]
2004
- A.
McCallum and B. Wellner. Conditional models of identity uncertainty with application to
noun coreference. In NIPS. 2004.[link]
- I. Bhattacharya and L. Getoor.
Iterative record linkage for
cleaning and integration. In
DMKD’04. DMKD. [link]
- I. Bhattacharya and L. Getoor.
Deduplication and group detection using links. In LinkKDD-04. 2004. [link
]
- M. Lee,
W. Hsu, and V. Kothari. Cleaning the spurious links in data. IEEE Intelligent Systems. 2004. [link]
- X. Li,
P. Morie, and D. Roth. Identification and tracing of ambiguous names: discriminative and
generative approaches. In AAAI,
2004. [link]
- P. Singla and P. Domingos. Multi-relational record linkage. In MRDM, 2004. [link]
- P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. [link]
- E. Agichtein
and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD,
2004. [link]
- M. Michaklowski,
S. Thakkar, C.
A. Knoblock. Exploiting Secondary Sources for unsupervised Record Linkage. In
VLDB, 2004. [link]
- R. Al-Kamha
and D.W. Embley. Grouping Search-Engine Returned Citations for Person Name Queries.
In WIDM’04, 2004. [link]
2003
- M. Bilenko and R. J. Mooney.
Adaptive Duplicate Detection Using Learnable String Similarity
Measures. In SIGKDD. 2003.
[link]
- M. Bilenko and R. J. Mooney.
On Evaluation and Training-Set Construction for Duplicate Detection. In KDD 2003 Workshop. 2003.
- S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani. Robust and efficient fuzzy match for
online data cleaning. In SIGMOD,
2003. [link]
- W. W.
Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics
for name-matching tasks. IIWeb Workshop, 2003.
[link]
- V. Verykios, G.V. Moustakides,
and M. Elfeky. A bayesian decision model for cost
optimal record matching. VLDB Journal, 2003. [link]
- L. Jin, C. Li, and S. Mehrotra.
Efficient Record Linkage in Large Data Sets. In DASFAA 2003, 2003.
[link]
2002
- R. Ananthakrishna, S. Chaudhuri,
and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB Conference. 2002. [link]
- G. Bhalotia, A. Hulgeri,, C. Makhe, S. Chakrabarti, and
S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE. 2002.[link]
- P. Christen,
T. Churches, and J. X. Zhu. Probabilistic
name and address cleaning and standardization. The Australian Data Mining Workshop, 2002. [link]
- W. W. Cohen
and J. Richman. Learning to match
and cluster high-dimensional data sets for data integration. In SIGKDD, 2002. [link]
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I.
Shpitser. Identity
uncertainty and citation matching. In
NIPS, 2002. [link]
- S. Sarawagi and A. Bhamidipaty.
Interactive deduplication
using active learning. In
SIGKDD, 2002. [link]
- S. Tejada, C. A. Knoblock, and
S. Minton. Learning domain
independent string transformation weights for high accuracy object
identification. In SIGKDD, 2002. [link]
- W. E.
Winkler. Methods for record linkage
and Bayesian networks. Technical
Report, US Census Bureau, 2002. [link]
Older
- A. E. Monge and C. P. Elkan. An efficient domain-independent
algorithm for detecting approximately duplicate database records. In SIGMOD, 1997. [link]
- A. E. Monge and C. Elkan. The field matching problem: Algorithms
and applications. In SIGKDD, 1996. [link]
- A. McCallum, K. Nigam,
and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with
Application to Reference Matching. In ACM KDD, Boston, MA,
2000. [link]
- E.
Cohen and D. Lewis. Approximating
matrix multiplication for pattern recognition tasks. J. Algorithms. 30(2): 211-252. [link]
- E. Ristad, and P.
Yianilos. Learning
string edit distance. IEEE
Trans. Pattern Analysis and Machine Intelligence, 1998. [link]
- G.
Navarro. A guided tour to approximate
string matching. ACM Computing
Surveys, 2001 [link]
- H. Newcombe, J. Kennedy, S. Axford,
and A. James. Automatic linkage of
vital records. Science, 1959.
- I. Fellegi and A. Sunter. A theory for record linkage. Journal of Amer. Statistical
Association. 1969 [link]
- J. Maletic and A. Marcus. Data cleansing: Beyond integrity checking. In Conf. on Information Quality,
2000. [link]
- L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a
database (almost) for free. In
VLDB, 2001.[link]
- L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L.
Pietarinen, and D. Srivastava.
Using qgrams
in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28–34, 2001. [link]
- M.
Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. [link]
- M. Jaro. Probabilistic
linkage of large public health data files. Statistics in medicine, 1995. [link]
- M. Jaro. Advances
in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of Amer. Statistical
Association, 1989.
- M. Lee,
H. Lu, T. Ling, and Y. Ko. Cleansing data for mining and warehouse. In DEXA, 1999. [link]
- S. Tejada, C. A. Knoblock, and S.Minton. Learning
object identification rules for information integration. Information Systems Journal, 2001.
[link]
- W.
Cohen, H. Kautz, and D. McAllester.
Hardening soft information sources.
In SIGKDD, 2000. [link]
- W. E.
Winkler. The state of record
linkage and current research problems. Technical Report,
US Census
Bureau, 1999. [link]
- W. W.
Cohen. Integration of heterogeneous
databases without common domains using queries based on textual
similarity. In SIGMOD, 1998.
[link]
Publication Categorizer on Data
Cleaning