Software Data Interchange with GXL: Implementation Issues

Susan Elliott Sim
University of Toronto
simsuz@cs.utoronto.ca
Keywords: standard exchange format, XML, reverse engineering, graph data model

Abstract

This workshop is a sequel to the GXL introduction and tutorial held in the morning.  Attendees will need to be familiar with the material presented in the morning workshop to participate fully in this one.

As GXL becomes widely accepted as standard exchange format for software data, some common practical issues emerge.  This workshop will cover two of them: implementation of GXL converters and the development of an architectural reference schema.  Most groups will need to create a converter between GXL and their local format.  Two early adopters will present experience reports on how they implemented converters, using their own programs and existing XML components.  After the presentations, there will be a question and answer period.

Subsequently, as groups exchange data with each other they need to come to an agreement what types of information they will provide to each other. This agreement is manifest as a reference schema, or object model, consisting of a standard set of entities, attributes, and relations necessary for a set of tasks or analyses.  At an earlier workshop it was decided that at least three reference schemas at different levels would be needed; listed in in order of increasingly fine granularity, they are the architectural level, the program entity level, and the abstract syntax tree level.  In the second part of this workshop, we will attempt to develop one of these reference schema. This portion of the workshop will be fairly technical and
discussion-oriented.
 

1. Background

Over the last few years, there has been a growing consensus in the reverse engineering community that a standard exchange format was necessary to advance the state of the art.  This format would be used to share data about software systems and enable greater interoperability between tools.  GXL (Graph eXchange Language) is an XML sublanguage that is emerging as such a standard exchange format.  The data model in GXL is a typed, attributed graph and it includes facilities for encoding the schema along with instance data.  Its development was influenced by data formats from reverse engineering, such as RSF, TA, and RPA.  As a result, this format can be applied to software easily.

While the basic concepts of GXL are well established, work remains to be done to finalise the mechanism for representing schemas and to establish protocols, or reference models,  for its use.  This workshop is part of a sequence of meetings where researchers have met to discuss exchange formats and shared their experiences with them.

2. Goals of workshop

The goal of this workshop is to disseminate information to aid in the adoption and refinement of GXL.  There are two parts to the workshop; both were included because they are problems that need to be resolved to make GXL a widely accepted exchange format.

3. Creating Converters

3.1 Guy St-Denis, Université de Montréal

[PowerPoint] [PDF] [PS]

The SPOOL tool suite extracts analyses object-oriented source code to extract design-level diagrams and metrics.  The project uses the Datrix(TM) parsers from Bell Canada which emit abstract syntax graphs for C++ and Java using the TA langauge, while the repository is stored in XMI (XML Metadata Interchange).  Consequently, a converter, called the SPOOL Gateway, was written to convert between the two formats.  The TA language is a graph-based format that is a predecessor of GXL, while XMI is an XML-based language for encoding models and metamodels.  While there were idiosyncracies in the syntax of the formats, the most significant challenge writing the converter was mapping from the schema used by Datrix, abstract syntax graphs, and the one used in XML to encode UML diagrams.  This presentation discusses this challenge and others encountered while implementing the SPOOL Gateway.

Guy St-Denis is a Master's student at Université de Montréal.

3.2 Jeff Michaud, University of Victoria

[PowerPoint] [PDF] [PS] (For some reason, the PDF file is almost 6MB, I recommend looking at one of the other files.)

To further extend Rigi's flexibility and compatibility with other tools, a converter was written between RSF (Rigi Standard Form) and GXL.  This talk covers the steps needed to implement this converter, including coping with a changing GXL definition.  Although RSF and GXL both use a graph-based data model, there were a number of open questions about converting between schemas.

Jeff Michaud is a Master's student at University of Victoria and he works on the SHriMP software.
 

3.3 Discussion

After the two presentations, Jeff and Guy formed a panel for a discuss commonalities in their experiences and to take questions from the audience.

All things considered, the XML sub-languages, (GXL and XMI), tools, and libraries were fairly easy to learn and use.  Both Guy and Jeff agreed that the most difficult aspect of writing converters was finding a reasonable mapping between the schemas of the source and target formats.   Some of the features of GXL, such as the user-defined attributes-value pairs, made it easy to defer the interpretation to the end user.  Howard Johnson suggested that some GXL-to-GXL converters should be created to perform transformations between schemas.  Consequently, programmers would need to write converters that modified syntax.  Jeff added that these converters would be similar to DTD-based converters used by the XML community.

Ric Holt pointed out that these difficulties demonstrate the need for another layer of "grammar" that would allow people to specify the "shape" of their data.  He illustrated this idea with the following diagram on a flip chart at the front of the room.
 

Grammar Level Interpretation
schema
-semantics of data being encoded
GXL
-attributed, typed, optionally directed graph
XML
-well-formed document with begin/end tags

Reference schemas would be a cannonical or standard schema for a particular level of analysis, say language level or architectural level.  Mappings between other schemas could be derived from the reference schema.  Both Jeff and Guy agreed that the existence of reference schemas would have simplified the task of creating mappings.

A common set of test cases would have been helpful.  For example, for a given input, what is the "correct" GXL for a particular level of analysis or schema.

A participant asked whether there were any missing or redundant elements in GXL.  Jeff replied that he did not find any missing or redundant elements, but that people using GXL for other purposes may find missing elements.  Another participant, asked what character sets were supported by GXL, since programming languages employ different character sets.  Howard replied that the GXL character sets default to the XML character set, which is Unicode.
 

4. Language-level Reference Schema for C++

4.1 Sean Perry, IBM

[Freelance] [PDF] [PS]

Visual Age C++ is a repository-based incremental compiler and integrated development environment.  This repository, called the CodeStore, allows different tools to access information about the source code.  Furthermore, the CodeStore has an API to allow users to write their own tools to plug-in to the IDE and compiler.  A brief overview of the CodeStore object model and API is presented in this talk.

Sean Perry is a developer at the IBM Toronto Lab with the Visual Age C++ back end team.

4.2 Ric Holt, University of Waterloo

[PowerPoint] [PDF] [PS]

The Datrix(TM) parsers were developed by a group at Bell Canada.  It analyses source code and outputs an abstract syntax graph, that is, an AST with some scope and reference resolution.  There are versions for ANSI-K&R C, C++ and Java.  Ahmed Hassan of the Software Architecture Group at University of Waterloo extracted the schema for the ASG by analysing the documentation and the parser.  This schema is the subject of this presentation.

Ric Holt is Professor at University of Waterloo.  He leads the Software Architecture Group and works with the PBS (Portable Bookshelf) tool suite.

4.3 Discussion

Susan Sim started the discussion by handing out entity-relationship diagrams language-level models for C and C++ from Institut für Softwaretechnik at University of Koblenz, and Datrix(TM) Group, Bell Canada.  These diagrams differed significantly from each other in approach, complexity, and visual presentation.  Where the Datrix(TM) models were of an abstract syntax graphs, the IST models were of the conceptual elements in the respective languages.  The presentations and the diagrams served as starting points for a discussion on how to proceed to develop a language-level reference model for C/C++.

A participant suggested that the discussion begin by identifying the requirements for the reference model, i.e. what do we want to use the data for.  The following incomplete list of applications was generated.

The discussion moved on to the tools that would produce or use the data that conformed to the schema.  Howard Johnson asked whether the complexity should be located in the parser or in the loader (the tool that read the data).  By complexity, it was meant analyzing and making inferences about the raw data.  For example, scope and reference resolution of the nodes in the abstract syntax tree could be performed during extraction or at some later time by another tool.  It appears that some complexity, and thereby intelligence, needs to be located in both levels.  General complexity, i.e. problems that need to be considered by everyone, could be put into the parser, while some application-specific complextiy could be put into the loader.

Some other requirements for reference model were also identified:

The discussion concluded with the identification of a number of open questions:

5. Conclusion

The developers of the GXL found the workshop very helpful.  They were able to see how GXL is currently being used and the difficulties faced by implementers.  It was also encouraging that their planned next step (the creation of reference schemas) fit with the needs of users.  The discussion about the how to develop reference schemas provided them with good feedback.  These lessons learnt will be applied in the future development of GXL.  It is hoped that this workshop encouraged the participants to become GXL users and promoters.

Acknowledgments

Thanks to Sonia Vohra who took notes during the discussions.

References

GXL home page.  http://www.gupro.de/GXL

WoSEF home page.  http://www.ics.uci.edu/~ses/wosef

Visual Age C++ home page.  http://www-4.ibm.com/software/ad/vacpp/

Datrix(TM) home page. http://www.iro.montreal.ca/labs/gelo/datrix