Re: More versioning thoughts

Fabio Vitali (vitali@cis.njit.edu)
Fri, 7 Jun 1996 03:53:06 -0500


Hi.

2:30 in the morning, I guess it's time for a little session of confessions.

Andre van der Hoek writes:
>   * It looks like different users are trying to push different capabilities.
>     Although expected, I think we should agree on a common `goal' for
>     versioning in HTTP. Are we trying to simply put versioning capabilities
>     in HTTP, are we trying to put *a* versioning capability in HTTP, are
>     we trying to seamlessly integrate versioning and non-version aware
>     clients/servers,etc. It looks like the group needs an objective very
>much;
>     this in itself could be an interesting exercise, because I think we
>     will be able to identify multiple layers of functionality that can be
>     built on top of eachother.

Today's topic is "what would I like versioning in the WWW for?"

Please note how I avoided the phrase "versioning in HTTP".

Ok. everything starts for me with Ted Nelson and his Xanadu project. The
whole worldly literature on line, and users adding to it, and commenting,
linking, quoting and including everybody else's stuff in their own
documents. A vision of the scale of the WWW, but active, collaborative,
participatory.

Two issues are important: existing and legacy data shouldn't require
particular adaptation in order to be put on line. Every user is allowed to
add links and comments (maybe just private ones, that only she and her
friends can access) to any document she can read. Or, even simpler, every
user is allowed to create a live quotation from any existing document into
one of her own. By live quotation, I mean a piece of text that it is taken
from another document, that allows the reader to still access the original
home of the piece (the document in which it was originated), and that may
automatically update to the evolving content of the original home document.

Because of the first issue, some documents may become quite large. This,
and any implementation of live quotations, say that we want references (in
links and in quotations) to be pointing to specific parts of the documents
(i.e. point-to-point links).

The second issue say that it is impossible for these references to be
stored within the host document, as named objects similar to <A NAME>
anchors. In fact, on a very large scale it would be possible for a
successful document to have thousands of link end-points. Possible
solutions include requesting the author of the document to name all the
interesting end-points in her document, which is impractical, imperfect,
incomplete, and basically leaves everybody at her mercy, or store the link
in an external link-base to which every user may have write access. For
this to work, a general addressing mechanism must be provided, which does
not rely on explicit names, but depends on the structure of the document
itself (for instance, in streams such as text, the character offset).
External link bases are a really good idea, because they allow private
links to be created, they allow private links to be shared with others, and
even to be published independently of the data they refer to.

Unfortunately, for any such addressing mechanism there are editing
operations that would make the references invalid. For instance, if you
store a reference to offset 34, and the author adds 20 more characters at
the beginning of the text, the offset is no longer valid.

The update of the offset becomes therefore an important issue. Solutions
have included the requirement that all existing references to a document
are kept outside of the document, but local to it (the Dexter Hypertext
Model works in this way, for instance, as well as many real hypertext
systems). Thus, whenever the changes to a document are committed, it
becomes easy to update all the references to the new, correct positions.
Unfortunately, this solution doesn't scale well: presuming that Yahoo keeps
a copy of every link that anyone in the world has created to one of its
pages, and updates the address values regularly, is absolutely laughable
and unbelievable.

Another solution that has been proposed is the use of heuristics in the
determination of the current address of the link. That is, enough context
information is kept of the whereabouts of the link end-point, so that when
the link is discovered to be outdated, it is possible with some
approximation to retrieve the correct position. This method is obviously
unreliable and imprecise. The Microcosm hypertext system should work in
this way.

A different solution is given by versioning. By requiring that every
document that we want to be linked is versioned, and by using a versioning
mechanism that can keep track of individual changes from version to
version, we can easily deduce the current position (if still existing) of a
stored address, even if there has been no contact and reciprocal awareness
between the stored link and the document it refers to.

External links are possible and reliable on versioned documents. This is
the main reason for VTML to exist, and to be a fine-grained change tracking
mechanism, and for being less interested in other classical versioning
issues such as configuration management.

Another situation for which VTML has been thought is the support for
asynchronous collaboration in the writing of shared documents. VTML should
support independent writing of single documents to naturally branch in
different versions, and should help authors to easily distinguish
individual contributions, compare them and merge in converging final
versions. Differently from David Durand, I don't believe in automatic
merging, and prefer authors to individually select the changes to be
incorporated in the merged document.

This kind of collaboration implies that it is a fairly common operation the
request to show an older version, or someone else's version of some
document. For this reason, I believe that this should be a client-based
operation, and that the server should always provide the complete set of
available versions of a document to a versioning aware client, and should
let the client deduce and display the specific version that the user
selected.

This is why VTML is capable of storing all the individual changes made in
all the existing versions of a document in the same data stream, so that
the client receives in the most compact form the complete set of changes
that have been performed on the document. External deltas of specific
versions should be used for efficient check-ins, efficient updates of
client-stored versioned documents ("who has checked in new versions of this
document since the last time I asked?") and efficient storage and
concurrency control.

Finally, I believe that all this should be COMPLETELY hidden to the unaware
user, who could go on and do her chores absolutely oblivious that the
editor is version-aware, and ignore versioning issues until suddenly she
needs some of the features, such as recovery of erroneously deleted data,
comparison and merge of independently modified versions, rollback to a
previous, safe state. This explain why version creation and management in
VTML can be completely automatic and hidden, for the user is not even
required to decide about the version names and freezing parameters unless
she needs to.

All this also explain why VTML doesn't actually need any change in the HTTP
protocol, doesn't require locking mechanisms on the server (which are
STATEFUL!), and basically exploit SGML notion of putting as much
information as possible in the data rather than in the code.

Can somebody find a use for this nonsense?

Fabio Vitali