CIPScene - April 1998

 

On the Importance of Content

(or why IT needs to know about SGML/XML/HTML)

A version of this paper was published in CIPScene (Canadian Information Processing Society Newsletter, April 1998).

In 1993 A.D. (After Downsizing), after some 30 years working in virtually all arenas of Information Processing from Programming to Data Administration, I suddenly found myself starting a “new career” as a “document engineer”. This article documents my revelations during my first assignment which have some relevance to IT professionals.

A key reason for becoming a “document engineer” was the suggestion that IT groups have traditionally failed to concern themselves with the vast majority of information in any enterprise. A Deloitte & Touche study found that anywhere from 70-90% of all information in a typical enterprise is found in what we loosely call documents (paper or electronic form). Only the remainder (10-30%) can be found in the structured databases that IT typically create and manage.

One reason for this is easily understood by examining these documents For example, using “reveal codes” in WordPerfect may show underlying markup such as:

[Centre] MEMORANDUM [hrt] [hrt]

To: [tab] [tab] [bold]John Smith[bold][hrt][hrt]

From: [tab] [bold]Jane Doe[bold][hrt][hrt]

Subject: [tab][bold][und]SGML may be a solution to our problem[und][bold]

[hrt][hrt]Take a look at this article on SGML technology. …

These embedded codes are instructions on how to render the text on the screen and paper (a key objective for WYSIWYG systems). They say nothing about the document content. It may appear at first glance that such documents could not be subject to programmatic processing the way that an SQL database is. However, consider the following alternative markup.

<memo>

<to>John Smith</to>

<from>Jane Doe</from>

<subject>SGML may be a solution to our problem.</subject>

<body>

<para>Take a look at this article on SGML technology.    …</para>

</body>

</memo>

Here the codes are content specific (descriptive markup) and have nothing to say about the presentation of the text (procedural markup) in any media. Astute readers might immediately see that questions like: “find all memos from Jane Doe”, or “list all memos which deal with the subject of SGML” could be answered programmatically. The equivalent questions with the previous text would require very intelligent processing of the content. In the second example, where data is tagged to specifically identify content, we can also programmatically define how a “to” or a “para” can be presented differently in different media.

This particular descriptive markup is an example of SGML tagging. SGML stands for Standard Generalized Markup Language (ISO 8879-1986). The standard allows you to design a markup language to be used in support of a specific application domain. For example in the memo example above, I might be dealing with the following markup language (presented graphically as an information model):


This model implies that a memo consists of one or more “to”, followed by “from”, followed by “subject” followed by zero or more keywords followed by the “body” of the memo. The body consists of one or more “para”. This model expressed in formal SGML syntax, is a Document Type Definition or DTD. The DTD defines both the markup language and allowed structure for a document to be tagged according to that DTD. It is possible programmatically to determine if a tagged document conforms to its DTD.

My first assignment dealt with the application domain of the maintenance life cycle of military equipment from purchase to mothballing. The specific problem I dealt with was the identification and definition of the set of data required to support the engineering and maintenance of the equipment over its entire life cycle. In other words, this was a standard database “data definition” problem from an IT point of view.

I would be using the SGML standard to develop a set of information “tags” (equivalent to fields in a database) which would then be used to tag the information base. For example, consider the highest level of such a model. The basic organization is a recursive assembly structure (each assembly consists of optionally, one more other assemblies). For each assembly at any level, you have basic information and optionally descriptive, operational, or servicing information.


The model fragment pictured above is part of the Canadian DND DTD for equipment engineering and maintenance. It is a standard for the exchange of information between equipment suppliers and the Canadian Military.

It occurred to me that if this was the “data base” then, by using symmetry arguments with the classical IT “posting problem”, there should be the equivalent of “transactions”, “reports” and “output files”.

These do in fact exist. The “transactions” are the information fragments originated by authors during the engineering of the equipment, such as equipment description, equipment operation, preventative and corrective maintenance procedures, troubleshooting information, parts lists, illustrated parts lists, etc.. The “reports” are standard technical manuals such as “Operations Manual” or “First Line Maintenance Manual” usually generated by technical publishing departments. The “output files” are selected sets of information destined for specialized functions such as training material (computer based training) or Intelligent Electronic Technical Manuals (IETM).

The key idea is to originate, mark up and store information based on its content. The information must be stored independent of how it is to be presented, vendor applications, and hardware/software platforms. This is especially critical for military equipment where the lifetime of equipment can now exceed 100 years. The chances of any current computer platform or application packages being operational 100 years from now is for all practical purposes exactly zero.

I expect most readers of this newsletter already use SGML technology in some form. For example if you use the WWW, the documents exchanged are marked up with HyperText Markup Language (HTML). HTML is an application of SGML. The HTML DTD defines the tag set (e.g. H1, H2, P, UL, etc.) for the problem domain (i.e. the exchange of documents via the web). The DTD design allows for extreme flexibility in tagging a document since the objective was to ensure that any HTML browser could produce a reasonable rendering of any document no matter how badly the document was tagged. In fact, the browsers need not check for compliance to the DTD. For this reason, most people do not even realize there is a DTD behind HTML.

The key reason for the success of HTML is at the same time, its greatest weakness. It does not allow for the descriptive tagging of content. Hence specific applications which must make use of content are very difficult if not impossible to implement.

Here is where XML enters the picture. If a financial application was to be implemented on the web, it has to have some way of knowing that the web transaction is indeed a financial transaction, perhaps a particular class of transaction, as well as the specific content of the transaction. This would be true of any application in any domain.

XML is a subset of the SGML standard designed for easier implementation. It still allows a designer to construct a tag set for an application domain. For example, a tag set could be developed for a domain such as a banking transaction. If the financial community can come to some agreement on this particular tag set, it can used as a standard for originating, exchanging and processing financial transactions in an analogous way that the assembly DTD described above can be used for engineering and maintenance information. Although a DTD might be defined for such an application, the XML standard only insists on documents being “well formed” rather than having strict compliance to the DTD. This gives the “flexibility” of HTML, while still achieving content definition. Work is currently underway on many such tagsets to support application domains in e-commerce and software upgrades

Hence I discover that my “new career” is not new, but the leading edge of the next wave of IT development. SGML/XML are enabling technologies to allow IT to start tackling the remaining 70-90% of corporate information in a vendor/processing platform/application program independent way.