
What is XAR ?
XAR is a general purpose framework and platform for building information extraction applications and systems. A particular focus is on the extraction of relations or slot-filling, from free text data. For each new extraction task, the user typically writes some XAR extraction rules which are based on Datalog. Consider a task of extracting a relation of details about reports or instances of corporate acquisitions (details such as the buyer, the company the got bought, amount acquired for etc.), from free text such as:.
| "Microsoft Corp today acquired ABC networks for $150 million in cash." |
buyer(X) :- organization (X), anchor(A), immbefore(X,A). buyer(X) :- anchor(A), who(A,X). |
We have also illustrated two Datalog based extraction rules that assign tokens from the text to the 'buyer' slot. The first rule is based on the observation of a general pattern that "any token which is an organization and which appears just before an anchor word in the text is likely the buyer". Some such rules have to be authored manually for each task, however we provide some attractive functionalities for the user:
Automated feature extraction: The predicates that you see in the rule bodies above, refer to properties of tokens and entities and relationships amongst them. For instance specifying that something is an organization, an "anchor" (usually an important action such as 'bought', 'acquired' etc.), the fact that some token is immediately before another token etc. A comprehensive set of such predicates is made available automatically to the user.
Exploiting semantic information: The user can further specify semantic information about the data to be extracted. Typically, in slot-filling, we are interested in extracting a relation (of facts or events) from text. We can specify a schema and also integrity constraints about this relation, which in fact is exploited in improving the accuracy of extraction and also significantly reducing user effort (for writing rules).
Extensibility: The XAR architecture is modular and open. For extracting features of tokens we have integrated in two analyzers by default (i) GATE which does named entity recognition and other "shallow" analysis, this helps us obtain features such as whether a particular token is a named-entity (such as an organization, location etc.), a specific type of word (an anchor, thing etc.), and structural information (what token is before a certain other token etc.), and (ii) the StanfordParser, a natural language parser using which we can (optionally) do a complete semantic parse of certain sentences and derive information from such analysis. The "who" predicate you see in the second rule above is actually synthesized from a semantic parse of a sentence. The user however is free to plug-and-play other analyzers (for instance an alternative to GATE may be a framework such as UIMA or even use a combination of the two), use an alternative NL parser or multiple parsers, or even integrate in different kinds of analyzers such as say a text classifier etc.
Probabilistic framework: Uncertainty is intrinsically associated with any automated extraction process. No step in the "extraction pipeline", be it the identification of named-entities or other tokens, parsing of sentences, or associating tokens with slots based on the extraction rules is error free. The XAR framework allows for the incorporation of probabilistic confidence values associated with the various predicates and extraction rules.
Usage and Effectiveness
You can get a quick assesment of the usage (i.e., what is required of the user in terms of extraction rules etc.) and also effectiveness (i.e., extraction accuracy) from the results of a news stories extraction task described here.
Obtaining XAR
Please send an email to ashish who resides at ics dot uci dot edu for a beta version of the software (and source code).
Documentation
1. Installation
4. Detailed description of the extraction specification and language
5. Using XAR
6. Research Issues for Exploration
Developers
The XAR system has been developed at UC-Irvine under the aegis of the NSF funded RESCUE and SAMI projects. The primary architect and developer of the XAR system is Naveen Ashish. Sharad Mehrotra is a collaborator on the design aspects.
XAR is at this point a research prototype and has not (yet) had benefit of any software engineer help, testing, or support. While I do not promise you well tested, bug-free code, we are offering out the source code. Also do email me any queries or clarifications either about the system usage or the system itself and I will try my best to respond in good time.