Course project: spreadsheet to XML converter
XML Data Management (Winter 2009)
offered by Professor Michael Carey
Members:
Shengyue Ji
Mingyan Gao
Download:
ss2xml.py (Command line tool)
xsdgen.py (GUI tool on top of ss2xml)
Requirements:
Python, SQLite(should come with Python)
PyGTK GTK+ for xsdgen.py
Brief Description
Many scientists use spreadsheets as a "poor man's database" for storing scientific data that may need to be shared. Design and implement a facility for automatically inferring "good" XML Schemas and extracting XML data from data stored in such spreadsheets. Make sure that your approach works for a variety of the sorts of spreadsheets that are found in practice, and consider including a "human assistance" capability to allow a domain expert to help the system arrive at an appropriate schema design. Test your approach on a variety of samples, some provided by Google and others provided by crawling the web and/or talking to various data owners in science-land.
Part 1 - Command line tool ss2xml
ss2xml.py reads spreadsheet file from stdin, generates XML file and writes to stdout,
according to predefined conversion rules. Conversion rules are specified using extended XML schema file (XSD).
The idea is to add special attributes on standard XSD to guide the conversion (e.g., how to map values, and
how to wrap up data from flat form to nested form). SQLite is heavily used to avoid repeated efforts such as
expression evaluation. In general each input spreadsheet file is defined in @bbx:define on xsd:annotation.
Data are mapped from spreadsheet to XML using @bbx:value on xsd:element and xsd:attribute. Flat data are wrapped
up using @bbx:foreach to created nested structure. See the following for conversion rule details.
(This is too hard to use, I'm gonna jump to Part 2 to see the GUI tool.)
Example
./ss2xml.py rules.xsd < input.csv > output.xml
Conversion rule XSD
- XSD can be extended in two ways:
- All elements of XSD can take {any attributes with non-schema namespace . . .}
- xsd:appinfo and xsd:documentation can take ({any})* as their children
We use the first one to extend XSD to support conversion rules in the following way:
xmlns:bbx = "http://blueberry.ics.uci.edu/xml"
- Introduce @bbx:define on xsd:annotation
- Introduce @bbx:foreach on xsd:element, xsd:sequence and xsd:choice
- Introduce @bbx:value on xsd:element and xsd:attribute
- Introduce @bbx:test on every element of XSD
- @bbx:define clauses can be used in 2+ ways:
- Load specified part of a CSV file into a SQLite table
bbx:define="table_name=csv:file_name[,starting_row[,ending_row[,starting_column[,ending_column]]]]"
The default values for optional arguments are: 1, last_row, a, last_column.
Example: ... file.csv,2,,,c loads the table starting from row 2 to the end, with columns a, b, and c.
The name of a table column is _ concatenated with the corresponding column name. E.g., _a, _b, ..., _z, _aa, _ab, ... _zz.
- Load specified part of an XLS file into a SQLite table: to be implemented, or not
Tab needs to be optionally specified in addition to rows and columns.
- Create a SQLite view/table using an SQL select statement
bbx:define="table_name=sql:select_statement"
SQL select statements are instantiated at running time.
- @bbx:foreach clauses can be used in 2 ways:
- Read one record each time into the specified variable from the results of an SQL select statement
bbx:foreach="$variable_name=sql:select_statement"
Elements with @bbx:foreach are processed the number of times same as the cardinality of the SQL selection.
- Read one table each time as a view from a list of existing SQLite tables
bbx:foreach="view_name=itr:table_1,table_2, ..., table_n"
Elements with @bbx:foreach are processed the number of times same as the length of the table list.
- @bbx:value clauses can be used in 2 ways:
- Read one value from a variable
bbx:value="var:$variable_name.column_name" or bbx:value="var:$variable_name[column_index]"
column_index starts from zero.
- Read one value from the results of an SQL select statement
bbx:value="sql:select_statement"
The first column of the first record in the results is returned as @bbx:value.
- @bbx:test clauses are used in the same way as @bbx:value.
The result value is tested using if statement of Python: 0 and "" evaluate to false; the rest evaluate to true.
Tables/variables
- Table names shall be defined/referenced without any decorations; record variable names shall be defined/referenced with a $ prefix (even in SQL select statements).
- Table/variable names shall never be defined more than once (either though bbx:define or bbx:foreach clause).
- The scope of a table or a variable is:
- For bbx:define on xsd:annotation:
>From the xsd:annotation element to the ending position of the parent of xsd:annotation;
- For bbx:foreach:
The entire element on which bbx:foreach is used.
Tables/variables shall never be referenced beyond their valid scopes.
- Column values from variables can be used in SQL select statements. E.g., "sql:select * from table1 where $rec._a = 3".
These select statements are instantiated each time they are processed (possibly with different $rec).
@bbx:test
- @bbx:test is used to decide whether the current xsd element shall produce output target or be skipped.
- When specified, the target element is forced to be outputted when the test passes, or skipped when the test fails.
- When unspecified, the output target shall be skipped to avoid producing empty target element if all of the following holds:
- Its children produces empty,
- It doesn't have any attribute,
- It doesn't have @bbx:value.
- @bbx:test on the same XSD element can possibly be evaluated to different values in different iterations.
Misc
- ss2xml.py does not validate the input XSD file.
- XML files produced from ss2xml.py do not necessarily conform to the schema (possible violations: typing, cardinality, key/unique...).
- The following features of XSD are not supported when generating the XML file: include, import, redefine, @substitutionGroup, abstract types, namespace, simpleContent/complexContent and related features.
- Global declaration is supported on simpleType, complexType, group, attributeGroup, element, and attribute.
- Global declaration of multiple elements: Caution! Use @bbx:test to avoid generation of multiple root elements.
Part 2 - GUI tool xsdgen
Additional requirement: PyGTK and GTK+.
"It seems the conversion rule XSD is very interesting,
but who's willing to write the complicated syntax you defined?"
xsdgen does. The purpose of this tool is let user generate the XSD rule file and then
convert spreadsheet to XML by clicking on GUI.
This tool depends on ss2xml.
Step 1: Add table(s).
Step 2: Verify table.
Step 3: Add cursor(s).
Step 4: Create XML tree and map values.
Step 5: Generate the XSD conversion rule.
Step 6: Generate the XML.