Some issues with design of a process discovery framework follow.

We have asserted that there are 3 main dimensions of data in a community web {web structure (physical-directory, logical), content, and usage/update patterns}. Usage and update patterns are two separate dimensions with the update patterns as being the time dimension, thus giving us four dimensions.

We start out our discovery process by crawling the web site and retrieving the documents on the community web. Some communities use CVS as a content management system. There are other types of content management systems we might need to be aware of. If CVS is used, we might be able to simply download the entire CVS tree. Binaries need not be downloaded unless they can be indexed. Otherwise, the filename will suffice. Though this brings up issues of the dark/invisible Web (protected documents) and disconnected components of the community Web graph (documents that are not linked to on pages), we may not be able to detect these via crawlers anyway. The challenge to this is that documents in this domain may be critical to the process and go undetected. What do we do next? With the dimensions of data given above, we want to gather a set of data from each document. We need to determine how this data will be represented.

Structure:

-Physical Structure

We can derive the physical structure easily, and here we want to store the URL of the document. Do we need to do some additional analysis of the URL beyond searching the string for process keywords? Does the depth of the page have significance? Intuitively, I believe it does. Perhaps we should keep a separate field for depth in the physical Web heirarchy.

-Logical Structure

This is somewhat more difficult to detect. There are several ways we can attack this problem. One way is to find the site map for the community. This we can match up with the physical document structure data and store the "logical url" of the page, as well as its depth. Is there some significance of the location of the document on the site map (e.g. top of page, bottom of page). Intuition says this could be a small factor. Another factor is the link structure between pages. How do we represent this? Perhaps we need fields for inlinks and outlinks of each document. Is this enough? Do we want to do any analysis correlating documents at degrees of separation > 1? We really should.

Content:

Our main activity here is indexing the text of the page against the classification of tools, roles/agents, activities, and resources. Since we cannot index the content of images and other binary types, data from these documents will be limited to structure, usage, and update data. We should grab a couple lines before and after the occurrence of the keyword and store this in the database by its URL.

Usage Info:

Usage info will be difficult to obtain as it cannot be readily done without access behind the firewall, so to speak. It usually requires Web access logs or, cooperation with users to login via a proxy as in WebQuilt, or at the very least a hit counter. Much of this depends on what type of usage modeling we are looking for. My thought here is that Web tours would be the very useful, but hit counts are also quite useful as well. Tours would allow us to get an idea of the workflow interactions of individual users, and as such, gives us a sequence of activities that can be reaffirmed or denied based on the other data we collect. Hit counts tell us the relative use of each page vs. the others. The theory here is that pages with higher hit counts are likely more important to development than the others. The challenge to this is that pages higher up in the physical or logical hierarchy will likely have higher interaction rates. A good web designer would also make it such that the degree of separation between critical pages to a task is small. These are additional factor we must consider. Perhaps some normalization can be done for pages higher up in the hierarchy. The challenge here is that they are at that position for a reason. So, maybe we have to take it at face value. I don't know yet.

Update patterns:

Update patterns are likewise difficult to detect. While Web content management systems can facillitate collection of this data, many communities lack such systems. To detect this without such data, we would need to periodically crawl the site and gather data from the LAST_MODIFIED tags of the http header or the file modification date. The question still remains how to represent this data and if there are other data that we need to be looking at. In a simple sense, pages that are updated with a great frequency are likely to be more critical to development. The main page for the site/module is likely to also have a high update frequency as those pages are the ones developers and users visit to find news.