The use of XML as a central strategy for managing scientific documents on the Web is investigated. We look at the various ways of handling XML in the Web context in the framework of a general document repository. Finally we introduce the TIPS Project that proposes a new approach to scientific information production and dissemination with XML at the of its storage paradigm.
XML [28], a meta-language introduced at the beginning of 1998 for describing structured data on the Web, builds upon experience gained with SGML and HTML during the last decade. XML was developed with the Web in mind and also guarantees a seamless integration with modern programming languages, such as Java, Perl, and Python. XML is based on Unicode [18] (see [10] for an introduction), so that it is well suited for dealing with multi-language documents, especially those containg lots of non-ASCII characters. Thus, XML [28] is an ideal storage format for a central repository containing data in various source formats, languages, and markup schema.
XML is widely supported by all major players in the Internet world, Open Source initiatives, as well as commercial vendors. Several free and commercial tools are available for all conceivable operating systems and purposes. In the near future, most Web browsers and visual editors will support XML natively.
Different ways of using an XML document in a repository are shown in Figure XML as the central part of a document strategy for the Web. At the top right we represent the XML document with its defining vocabulary (DTD or XML Schema [23], [24] and [25]). This document, which is encoded in Unicode, can be viewed, searched, indexed, edited, validated without problems by any of a series of XML-aware applications all over the world. The XML document can be typeset using TeX (three methods, labelled A, B, and C, are discussed in [5]) or its 16-bit Unicode-aware variant Omega [12]. It can also be transformed into HTML for viewing with present-day browsers (X2H via XSL). In the (near) future, once browsers will be able to handle XML directly, we can probably skip the HTML intermediate format and let CSS [20] (possibly via XSLT) style the XML file directly for display on the Web. Figure XML as the central part of a document strategy for the Web also contains arrows going from left (TeX) to right (XML/HTML, browsers). They indicate programs to transform existing LaTeX source documents into XML (using one or more standard DTDs) to store the information for archiving purposes. The vertical ellipse in the centre represents other editing tools, such as Adobe's FrameMaker [1], Microsoft's Word [14], and Corel's WordPerfect [7], that allow or are expected to allow import/export of XML documents. Thus, XML genuinely becomes the central element in a global strategy for managing electronic documents by allowing information to be stored, saved, shared, and used by different applications on all computer platforms.
|
| Figure 1. XML as the central part of a document strategy for the Web |
It is not sufficient for XML documents to be available on the Web, but it must also be possible to output them in a typographically optimal way. LaTeX [8] has been one of the pilers of typesetting scientific documents for many years. Therefore, it is important that tools be available to transform an electronic document between XML and LaTeX [9]. PassiveTeX [16] and xmltex [4] are two recent developments that ensure that XML sources can be typeset directly or with the help of XSLT [29] stylesheets with TeX. As a added bonus, PassiveTeX supports mathematics marked up in MathML [21] directly, so that an XSLT style sheet can pass MathML's <math> source elements and its children through unchanged. This guarantees that mathematical material, involving even complex formulae, will be typeset perfectly. An alternative using DSSSL [13] and Jade [6] is available.
In the other direction, one can translate LaTeX to HTML (or its XML version XHTML [27]) with LaTeX2html [15] or TeX4ht [11]. One can also choose another target language, such as DocBook [19], for computer documentation, or TEI [3], used in the humanities. For both the DocBook and TEI DTDs, XSLT styles sheets exist to transform their source form to HTML or XSL Formatting Objects [30], a generic format that can be translated into PDF by PassiveTeX or FOP [2]
All documents in the XML repository should be accompanied by metadata in the form of RDF [22]. Source files (images, scanned texts or manuscripts, data files) into which it is impossible or impractical to introduce XML markup, should be accompanied by an external XML file containing similar RDF data. This ensures that all documents in the database present a uniform XML interface. This is important for indexing, search, and data mining.
Part of a source document marked up using the TEI and MathML XML languages follows.
<div1 id="vavref">
<head>Vavilov theory</head>
<p>Vavilov<ptr type="bib" target="bib-VAVI"/> derived a
more accurate straggling distribution by introducing the kinematic
limit on the maximum transferable energy in a single collision, rather
than using
<inlinemath><math><msub><mi>E</mi><mrow><mtext>max</mtext></mrow></msub>
<mo>=</mo><mi>∞</mi></math></inlinemath>.
Now we can write<ptr type="bib" target="bib-SCH1"/>:
<eqnarray ><subeqn><math><mi>f</mi> <mfenced open='(' close=')'>
<mi>ε</mi><mo>,</mo><mi>δ</mi><mi>s</mi></mfenced>
<mo>=</mo> <mfrac><mrow><mn>1</mn></mrow>
<mrow><mi>ξ</mi></mrow>
</mfrac><msub><mi>φ</mi><mrow><mi>v</mi></mrow></msub>
<mfenced open='(' close=')'>
<msub><mi>λ</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>κ</mi><mo>,</mo><msup><mi>β</mi><mrow><mn>2</mn></mrow>
</msup></mfenced></math></subeqn></eqnarray>
where
<eqnarray><subeqn><math><msub><mi>φ</mi><mrow><mi>v</mi></mrow></msub>
<mfenced open='(' close=')'>
<msub><mi>λ</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>κ</mi><mo>,</mo>
<msup><mi>β</mi><mrow><mn>2</mn></mrow></msup></mfenced>
<mo>=</mo>
<mfrac><mrow><mn>1</mn></mrow>
<mrow><mn>2</mn><mi>π</mi><mi>i</mi></mrow>
</mfrac>
<msubsup><mo>∫</mo>
<mrow><mi>c</mi><mo>+</mo><mi>i</mi><mi>∞</mi></mrow>
<mrow><mi>c</mi><mo>-</mo><mi>i</mi><mi>∞</mi></mrow></msubsup>
<mi>φ</mi><mfenced open='(' close=')'><mi>s</mi></mfenced>
<msup><mi>e</mi><mrow><mi>λ</mi><mi>s</mi></mrow></msup>
<mi>d</mi><mi>s</mi><mspace width='2cm'/><mi>c</mi><mo>≥</mo><mn>0</mn>
</math></subeqn>
<subeqn><math><mi>φ</mi><mfenced open='(' close=')'><mi>s</mi></mfenced>
<mo>=</mo><mo>exp</mo><mfenced open='[' close=']'><mi>κ</mi>
<mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>β</mi>
<mrow><mn>2</mn></mrow></msup><mi>γ</mi><mo>)</mo></mrow>
</mfenced><mo>exp</mo><mfenced open='[' close=']'><mi>ψ</mi>
<mfenced open='(' close=')'><mi>s</mi></mfenced></mfenced>
<mo>,</mo> </math></subeqn>
The result of typesetting this document with xmltex and PassiveTeX is
shown in Figure The document formatted by LaTeX. Although MathML is
rather verbose, it is not too difficult to recognize the code for the
formulae shown, so that it becomes possible to search all parts of a
documents, including the mathematical formulae, and, soon the
graphics, once SVG (Scalable Vector Graphics, see [26]) will be more widely supported.
|
| Figure 2. The document formatted by LaTeX |
Tim Bray said that XML is the ASCII for the 21st century. XML allows documents in all major world languages to be viewed and transmitted in a standard reliable way. Many tens of XML applications and vocabularies exist for XML-encoded tree-structured documents and data.
In order to leverage these XML technologies and to benefit from the interoperability and robustness of widely deployed XML solutions of the Web the TIPS project [17] was initiated.
TIPS (
The proposed system will support the activities of document writing, reviewing, publishing, searching, disseminating and reading, as well as the communication among members of the research community. This approach is suitable for supporting a more productive research community, in which researchers can work in a more effective, inexpensive, and pleasant way: delays and costs due to paper documents can be considerably reduced, multimedia can be added to electronic documents, information access can be improved (and information overload decreased) by using advanced information retrieval and filtering techniques. The full possibilities of XML technologies will be fully exploited wherever possible.
An updated version, focussing on PassiveTeX will be published in the proceedings of the TUG2000 Conference (TUGBoat Vol 23, September 2000). A preprint is temporarily available here
Michel Goossens (well known by many users) is a CERN authority on LaTeX, XML and Electronic Document Publishing techniques in general. He has written several articles and books on the subject.