We describe the specification (language) for the schema, extraction rules, and constraints formally and in more detail. Note that in addition we have also provided specifcation for some preassembled application in the directory "preassembledspecs". There you will find a complete schema, set of extraction rules, and constraints for 3 different extraction extraction tasks.

1. Schema

We basically have a list which contains , seperated items of the form

buyer : company

To the right of the : we specify the type. We can specify multiple types, for instance

buyer : company university

The basic types the system recognizes are person, organization, location, country, thing, action, money, date,time, source, degree, university, and title

Arity or NULL valued related constraints can also be specified. For instance

buyer : company university not NULL MV

So a complete schema schema is:

create table acquire( buyer : company university not NULL, amount : money, whom: company not NULL).

createstatement := <create> <event> X <(> attribute_list <)>

attribute_list :=

2. Rules

(i) Predicates. A number of tokens of various kinds get instantiated after the features extraction. We have predicates corresponding to each of these. Namely the predicates person(X), organization(X), thing(X), date(X), location(X), sentence(X).

There are also some special predicates. There is a vbanchor predicate which is true for all tokens that satisfy being anchor verbs (as specified in <>rules.txt). Also special predicates get generated correspoding to any patterns specified in <>patterns.txt. For instance for vcdeals we have special predicates such as "toksecured" etc.

(ii) Other features and predicates

Let us use the following text segment as an example. "Microsoft today bought ABC networks for $45 million in cash. Microsoft stock rose slight by 0.5% to close at $25.76 a share at the end of trading today. Neither company provided any firther details."

Predicate Example
sentence(X,S) : The token X is in sentence S. sentence(microsoft,1)
position(X,P) : The position P of the token X position(abc_networks, 24)
before(X,Y) : The (absolute) position of (token) X is before that of (token) Y before(microsoft, abc_networks)*
after(X,Y) : The (absolute) position of (token) X is before that of (token) Y after(bought, cash)
between(X,Y,Z) : Token Z is between token X and token Y between(microsoft,45_million,abc_networks)
immbefore(X,Y) : Token X is immediately before token Y (no tokens in between X and Y) immbefore(bought, abc_networks)
immafter(X,Y) : Token X is immediately after token Y (no tokens in between X and Y) immafter(microsoft, today)
neighbor(X, Y) : Sentence X and Y are adjacent. neighbor(2,3)
neighbor(X,Y,D) : Sentence X and Y are at most D (i.e., D number of sentences) apart neighbor(1,3,2)
insamesentence(X,Y): X and Y are (any) tokens in the same sentence insamesentence(microsoft, abc_networks)
who(X,Y): Typically the doer or subject of an action who(bought, microsoft)
what(X,Y): Typically the object of an action what(bought, ABC_networks)
whrwhn(X,Y): Information prepositionally associated with the action whrwhn(bought, $45million_cash)

(iii) The declarative rules.

A rule is basically of the form:

<rulehead> :- <confidence>* <list of predicates>

The confidence value is optional.

The following are the properties of the rules.

1. The head corresponds to a slot

For instance:

R1: buyer1(X) :- organization(X), vbanchor(V), before(X,V).

R2: buyer2(X) :- vbanchor(V), who(V,X).

R3: buyer(X) :- 0.7 buyer1(X).

R4: buyer(X): 0.8 buyer2(X).

R1 and R2 are intermediate rules and R3 and R4 are assignment rules. The head though always corresponds to a slot, in this case buyer.

2. The body may refer to EDB predicates as well as heads of other rules

buyer(X) :- whom(W), organization(X), insamesentence(X,V), before(X,W)

The rule for buyer, has a reference to whom, which is a slot predicate

3. Multiple rules are permitted for the same slot

4. Negation is permitted but only for predicates in the body of the rule.

R1: buyer1(X) :- vbanchor(V), who(V,X).

R2: buyer2(X) :- organization(X), vbanchor(V), before(X,V), not buyer1(Z).

So R2 fires only if R1 cannot be satisfied. Not also that the rules must be stratified wrt negation.

5. There are several primitives provided for the user's convenience

(i) firstamongst, lastamongst

Choose the first/last amongst (any) set of of tokens, first and last refer to the order of placement in the text segment.

myanchor(X) :- anchor(A), action(X), after(A,X)

firstamongst[myanchor(X)].

"Microsoft bought ABC networks and later announced that it would make further acquistions this year."

myanchor defines actions that appear after an anchor word. in the abov example these are 'announced' and 'make'. firstamongst selects the first of these i.e., 'announced'.

(ii) Grounding values

We can refer to ground values of tokens.

... :- .... action(X)[announced]

referes to action tokens with the specific value "announced"

3. Constraints

Constraints are esentially integrity constraints over the relation to be extracted. These are specified in regular syntax. Constraints can be at the attribute, tuple, or relation levels. Some examples.

1. Attribute level constraint

The acquistion amount has to be at least $15 million

check amount > $15 milllion

2. Tuple level constraint

The total funding amount (in the VC funding application) must be greater than the current amount secured

check total > current