Data Cleaning/ Object Consolidation / Entity Resolution/ Reference Disambiguation Bibliography

This list is partial and evolving. Please let us know if you find out any missing publication.

Grouped by Publication Date

 

2007 | 2006 | 2005 | 2004 | 2003 | 2002| Older |

 

2007

  1. P. Kanani and A. McCallum. Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes. IIWEB, AAAI workshop, 2007.
  2. S. Yan, D. Lee, M-Y Kan and L.C. Giles. Adaptive sorted neighborhood methods for efficient record linkage. JCDL 2007. [link]
  3. S. Chaudhuri, A DasSarma, V. Ganti and R. Kaushik.Leveraging Aggregate Constraints For Deduplication. SIGMOD, 2007. [link]
  4. S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP and CNLP, 2007. [link]
  5. Y. Chen and J. Martin. Towards Robust Unsupervised Personal Name Disambiguation. EMNLP and CNLP, 2007. [link]
  6. A. Yates and O. Etzioni. Unsupervised Resolution of Objects and Relations on the Web. NAACL HLT, 2007. [link]
  7. Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Adaptive Graphical Approach to Entity Resolution. JCDL, 2007. [link]
  8. X. Yin, J. Han, and P.S. Yu. Object Distinction: Distinguishing Objects with Identical Names. ICDE, 2007. [link].
  9. D. V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra and N. Ashish. Disambiguation algorithm for people search on the web. In the proceedings of IEEE ICDE 20007 Conference. April, 2007
  10. R. Nuray-Turan, D. V. Kalashnikov and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In the Proceedings of DASFAA 2007. April, 2007.[link]
  11. B. On, D. Lee. Scalable Name Disambiguation using Multi-level Graph Partition. In SIAM SDM, April 2007 [ link]
  12. B. On, N. Koudas, D. Lee, and D. Srivastava. Group Linkage. In ICDE 2007. April, 2007.[link]
  13. W. Shen, P. DeRose, L. Vu, A. Doan,  and R. Ramakrishnan. Source-aware Entity Matching: A Compositional Approach. In ICDE 2007, April 2007. [link]
  14. P. Kanani, A. McCallum, and C. Pal. Improving author coreference by resource-bounded information gathering from the web. In IJCAI. 2007. [link]

 

2006

  1. R. Bunes cu and Marius Pasca. Using Encyclopedi Knowledge for Named Entity Disambiguation. EACL, 2006. [link]
  2. M. Weis and F. Naumann. Detecting Duplicates in Complex XML Data. ICDE, 2006. [link]
  3. D. Bollegala, Y. Matsuo, and M. Ishizuka. Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases. ECAI, 2006.[llink]
  4. F. Folino, G. Manco, and L. Pontieri. Effective Incremental Clustering for duplicate detection in Large Databases. IDEAS, 2006. [link]
  5. D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. In ACM TODS. June, 2006.[link]
  6. V. Sehgal, L. Getoor, and P. Viechniki. Entity resolution in geospatial data integration.  In GIS, 2006.  [link]
  7. M. Bilgic. L. Licamele, L. Getoor, and B. Schneiderman. D-dupe: An interactive tool for entity resolution in social networks. In IEEE VAST ,2006.[link]
  8. B. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped-entity resolution using quasi-cliques. In ICDM 2006. December, 2006 [link]
  9. I. Bhattacharya and L. Getoor. Query-time entity resolution. In SIGKDD. 2006 [link]
  10. P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM 2006. December, 2006. [link]
  11. M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive Blocking: Learning to Scale Up Record Linkage and Clustering. In ICDM. 2006. [link]
  12. A. Culotta and A. McCallum. Tractable learning and inference with higher-order representations. In ICML Workshop on Open Problems in Statictical Relational Learning. 2006. [link]
  13. B. On, E. Elmacioglu. D. Lee, J. Kang, and J. Pei.  An effective approach to entity resolution problem using quasi-clique and its application to digital libraries.  In JCDL. June, 2006.[link]
  14. E. Minkov and W. W. Cohen. An Email and Meeting Assistant using Graph Walks. In CEAS-2006. [link]
  15. E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual Search and Name Disambiguation in Email using Graphs. In SIGIR-2006. [link]
  16. Y. F. Tan, M-Y. Kan and D. Lee. Search Engine Driven Author Disambiguation. In JCDL. June, 2006. [link]
  17. I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.[link]
  18. I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution.  In SIAM SDM. 2006. [link]
  19. L. Bolelli, S. Ertekin, C.  L.Giles. Clustering Scientific Literature Using Sparse Citation Graph Analysis. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 30-41, 2006. [link]
  20. J. Huang, S. Ertekin, C. L. Giles. Efficient Name Disambiguation for Large-Scale Databases. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 536-544, 2006. [link]
  21. D. Menestrina, O. Benjelloun, H. Garcia-Molina. Generic Entity Resolution with Data Confidences. In First Int'l VLDB Workshop on Clean Databases,,2006.[link]
  22. O. Benjelloun, H. Garcia-Molina, H. Kawai, T. E. Larson, D. Menestrina, Q. Su, S. Thavisomboon, J. Widom. Generic Entity Resolution in the SERF Project. IEEE Data Engineering Bulletin, June 2006. [link]
  23. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006 . [link]
  24. A. Arasu, V. Ganti, R. Kaushik.Efficient exact set-similarity joins. In VLDB, 2006.[link]
  25. Y. Zhuang and L. Chen. In network Outlier Cleaning for Data Collection in Sensor Networks. In CleanDB Workshop.[link]
  26. J. Hassell, B. Aleman-Meza, and I. B. Arpinar. Ontology-driven automatic entity disambiguation in unstructured text. In 5th International Semantic Web Conference (ISWC2006), 2006. [link]

 

2005

  1. D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain independent data cleaning.  In SDM 2005. 2005. [link]
  2. Z. Chen, D. V. Kalashnikov and S. Mehrotra. Exploiting relationships for object consolidation. In IQIS. 2005. [link]
  3. L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Datasets. In VLDB, 2005. [link]
  4. M. Bilenko, S. Basu, and M. Sahami. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. In ICDM. 2005. [link]
  5. W. Shen, X. Li and A. Doan. Constraint-based entity matching. In AAAI 2005. 2005.
  6. A. Culotta and A. McCallum.  Joint deduplication of multiple record types in relational data.  In CIKM. 2005. [link]
  7. A. McCallum, K. Bellare and F. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI. 2005. [link]
  8. R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW. 2005. [link]
  9. J. Kang, D. Lee and P. Mitra. Identifying value mappings for data integration: an unsupervised approach. In WISE. 2005. [link]
  10. A. Al-Lawati. D. Lee. And P. McDaniel. Blocking-aware private record linkage. In IQIS. 2005. [link]
  11. D. Lee, B. On. J. Kang and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS. 2005. [link]
  12. I. Bhattacharya and L. Getoor. Relational clustering for multi-type entity resolution. In MRDM.2005. [link]
  13. J. Artiles, J. Gonzalo, an S. Sekine. A testbed for people searching strategies in the WWW. In SIGIR. 2005.[link]
  14. S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. Narasayya, and T. Vassilakis. Data cleaning in Microsoft SQL server. In SIGMOD, 2005. [link]
  15. X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. [link]
  16. S. Hill. Social network relational vectors for anonymous identity matching. In IJCAI, 2005. [link]
  17. R. Holzer, B. Malin and L. Sweeney. Email alias detection using social network analysis. In SIGKDD Workshop, 2005.[link]
  18. B. Malin. Unsupervised name disambiguation via social network similarity. In Workshop on Link Analysis, Counterterrorism,  and Security, 2005. [link]
  19. H. Han, H. Zha, C. L. Giles. Name disambiguation in author citations using a K-way spectral clustering method. Joint Conference on Digital Libraries 2005 (JCDL 2005): 334-343, 2005.[link]
  20. S. Chaudhuri, V. Ganti, R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005. [link]
  21. B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov. Blog: Probabilistic models with unknown objects.  In IJCAI, 2005. [link]
  22. X. Li, P. Morie, and D.Roth. Semantic integration in text: From ambiguous names to identifiable entities.  AI Magazine. Special issue on semantic integration. 2005. [link

2004

  1. A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS. 2004.[link]
  2. I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD’04. DMKD. [link]
  3. I. Bhattacharya and L. Getoor. Deduplication and group detection using links. In LinkKDD-04. 2004. [link ]
  4. M. Lee, W. Hsu, and V. Kothari. Cleaning the spurious links in data. IEEE Intelligent Systems. 2004. [link]
  5. X. Li, P. Morie, and D. Roth. Identification and tracing of ambiguous names: discriminative and generative approaches. In AAAI, 2004. [link]
  6. P. Singla and P. Domingos. Multi-relational record linkage. In MRDM, 2004. [link]
  7. P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. [link]
  8. E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD, 2004. [link]
  9. M. Michaklowski, S. Thakkar, C. A. Knoblock. Exploiting Secondary Sources for unsupervised Record Linkage. In VLDB, 2004. [link]
  10. R. Al-Kamha and D.W. Embley. Grouping Search-Engine Returned Citations for Person Name Queries. In WIDM’04, 2004. [link]

 

2003

  1. M. Bilenko and R. J. Mooney. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In SIGKDD. 2003. [link]
  2. M. Bilenko and R. J. Mooney. On Evaluation and Training-Set Construction for Duplicate Detection. In KDD 2003 Workshop. 2003.
  3. S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. [link]
  4. W. W. Cohen, P. Ravikumar, and S. E. Fienberg.  A comparison of string distance metrics for name-matching tasks. IIWeb Workshop, 2003. [link]
  5. V. Verykios, G.V. Moustakides, and M. Elfeky. A bayesian decision model for cost optimal record matching.  VLDB Journal, 2003. [link]
  6. L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA 2003, 2003. [link]

 

2002

  1. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB Conference. 2002. [link]
  2. G. Bhalotia, A. Hulgeri,, C. Makhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE. 2002.[link]
  3. P. Christen, T. Churches, and J. X. Zhu. Probabilistic name and address cleaning and standardization. The Australian Data Mining Workshop, 2002. [link]
  4. W. W. Cohen and J. Richman. Learning to match and cluster high-dimensional data sets for data integration.  In SIGKDD, 2002. [link]
  5. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002. [link]
  6. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. [link]
  7. S. Tejada, C. A. Knoblock, and S. Minton. Learning domain independent string transformation weights for high accuracy object identification.  In SIGKDD, 2002. [link]
  8. W. E. Winkler. Methods for record linkage and Bayesian networks. Technical Report, US Census Bureau, 2002. [link]

 

Older

  1. A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records.  In SIGMOD, 1997. [link]
  2. A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications.  In SIGKDD, 1996. [link]
  3. A. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In ACM KDD, Boston, MA, 2000.  [link]
  4. E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. J. Algorithms. 30(2): 211-252. [link]
  5. E.  Ristad, and P. Yianilos. Learning string edit distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 1998. [link]
  6. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 2001 [link]
  7. H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 1959.
  8. I. Fellegi and A. Sunter. A theory for record linkage. Journal of Amer. Statistical Association. 1969 [link]
  9. J. Maletic and A. Marcus. Data cleansing: Beyond integrity checking. In Conf. on Information Quality, 2000. [link]
  10. L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.[link]
  11. L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using qgrams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28–34, 2001. [link]
  12. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995.  [link]
  13. M. Jaro. Probabilistic linkage of large public health data files. Statistics in medicine, 1995. [link]
  14. M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of Amer. Statistical Association, 1989.
  15. M. Lee, H. Lu, T. Ling, and Y. Ko. Cleansing data for mining and warehouse.  In DEXA, 1999. [link]
  16. S. Tejada, C. A. Knoblock, and S.Minton. Learning object identification rules for information integration. Information Systems Journal, 2001. [link]
  17. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000. [link]
  18. W. E. Winkler. The state of record linkage and current research problems. Technical Report, US Census Bureau, 1999. [link]
  19. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD, 1998. [link]

 

Publication Categorizer on Data Cleaning