Writing your own application: A walk-through example
Let us now lead you through writing your own extraction application. Each appliation or extraction task requires specifications, and a XAR user must know how to write such specifications.. We describe this process with a complete walk-through example.
Take the domain of new stories that are related to venture funding and venture investments for start-up or small companies. Say we are interested in mentions of events or occurences of a company having received funding or investment from an investor(s) or investment fund(s). The specific details that we are interested in are (i) The name of the company that has received funding or investment (ii) The date of such investment announcement (iii) The amount received in funding and (iv) The investor(s) that provided the investment.
For instance, from a text segment such as:
" February 19, 2007 PALO ALTO. Dash Navigation announced that it has secured $25 million in Series B funding, positioning the company for the national consumer launch of its Internet-connected automotive GPS device this fall. The round was led by Crescendo Ventures with new investors Artis Capital, ZenShin Capital Partners and Gold Hill Capital as well as additional participation from existing investors Kleiner Perkins Caufield & Byers, Sequoia Capital and Skymoon Ventures. "
The details of interest are (i) Dash navigation (ii) February 19, 2007 (iii) $25 million (iv) Crescendo Ventures, Artis Capital, ZenShin Capital Partners, Gold Hill Capital Partners
A corpus of such news stories is provided in the directory "examples/data/vc". We have already illustrated how you can create a databank of features extracted from the corpus in question. A databank, called 'vcdb1' is already provided.
To use XAR for extraction of such information (in any domain) the user needs to provide 4 key specifications. These are 1) A schema, describing the details of the information to be extracted, 2) An (optional) set of constraints that the extracted data must satisfy, and 3) A set of extraction rules, and 4) A listing of common "patterns" of interest (we elaborate on this shortly)
We step through these in the context of the above example and domain.
1) Schema
Let us give your domain a name and call it the "venture" domain. Create a file corresponding to the schema for the domain called "ventureschema.txt". (Use the convention <domainname>schema.txt)
This file should contain the statement:
create event vfunding (company : organization, when : date, amount : money, investors : organization MV)
The is akin to an SQL "create" statement where we have specified the various "attributes" to be extracted. For each attribute we have provided the type(s), also note that investors can be multi-valued.
The BNF specification for this "create" statement, permissible types etc. are provided here.
2) Patterns
Let us now look at the dataset to see what heuristics we can use to craft some rules for extraction. Many of the reports have sentences of the form "......<X company > has raised <Y amount> in ...", or "...<X company> ... announced ... has secured <Y amount> for ..."
The phrases "has raised", "has secured" etc. seem good indicators of a sentence (set of sentences) reporting an occurence of a venture investment. We now specify that we want to explicitly highlight any such phrases in the text. Create a new file called "venturepatterns.txt" (again following the convention <domainname>patterns.txt) and specify the following:
"(has raised)" "TOKRAISED"
"(has secured)" "TOKRAISED"
This specifies that the above two phrases are to be flagged with the marker "TOKRAISED" anywhere they occur in the data.
The details of the regular expression format are provided here.
3) Extraction Rules
Let us first distill some heursitics for extracting the required details. Examining some of the documents, reasonable set of heuristics seems to be:
(i) Any organization before the 'TOKRAISED' phrase in a sentence, is likely the company (that has received funding).
(ii) Any (monetary) amount immediately after the 'TOKRAISED' phrase is likely the amount invested.
(iii) The first date mentioned in the document, is likely the investment (announcement) date.
(iv) Any organizations mentioned after the 'TOKRAISED' phrase (in the same sentence or sentence immediately following) are likely the investors.
These rules are by no means perfect, but reasonable.
Create another file called "venturerules.txt". There specify the following rules:
company(X) :- 0.8 tokraised(R), before(X,R), insamesentence(X,R).
amount(X) :- 1.0 tokraised(R), immafter(X,R).
investor(X) :- 0.8 tokraised(R), after(X,R), neighborsentence(X,R,2)
The details of the prdicates provided by the system and the logic based rule language are provided here.
4) Constraints
Finally let us look at any constraints that we can specify about the data. These help in making the extraction more accurate and eliminating errors. In this domain there are two such constraints that come to mind, namely (i) It is unlikely that the investment amount is greater than $30 million, and (ii) The total investment is greater than the current investment
Domain
check amount < 30
Tuple
check amount < total
At the end of this point you have four specifications, ventureschema.txt, venturepatterns.txt, venturerules.txt, and ventureconstraints.txt
You may now run this to extract data, exactly as we did in the pre-assembled application earlier.