Sustainability of word processing documents

Ian Barnes

Friday, 20 January 2006, 3:59:23 PM

Word processing formats are a major problem for digital repositories. A large fraction of the material we want to preserve is created in these formats, but they are generally not suitable for long-term preservation because:

  • they are proprietary, and the owners can change either the format or the conditions of use at any time; and

  • they are flat (rather than structured), which makes information retrieval, viewing, printing and reuse more difficult.

We need to consider converting documents into a better format for long-term storage. A suitable format should be widely-used, a stable and recognised standard, and versatile enough to handle all documents from research monographs to the minutes of the finance committee to lecture notes to articles ready for submission to scholarly journals. It should also be easy to process using standard software tools. This last requirement is a strong argument for choosing an XML format. Candidate formats are XHTML, DocBook and TEI. I believe that DocBook & TEI are better than XHTML because they are more structured.

The problem with richly structured formats like DocBook XML and TEI is that word processing documents generally do not contain enough structural information to allow for an automated conversion process. There are a few possibilities.

  • The best scenario is that the document was created using a well-designed word processor template, so that every paragraph has a style name attached to it. These style names can then be used as hooks by an automated conversion process in order to deduce structure. The prototype Digital Scholar’s Workbench described below uses this strategy.

  • For legacy documents or for authors who refuse to use a template, the word processing document will have to be edited by an digital document archivist to get it into a state where it can be converted to DocBook (or TEI). The obvious way to do this is to open it in a word processor, import the template, and then go through the document applying the template styles. With help from keyboard shortcuts and macros (available in both Word and Writer), this might not be too painful, at least for relatively simple documents. Another possibility is to create a specialised digital document archivist’s workbench application for doing this kind of work.

  • For documents that are extremely poorly formatted, or that exist only on paper, a third alternative is to send them out to be rekeyed. This is expensive, but for high-value documents it may be worth it.

  1. For new documents, have authors create them in a way that makes preservation easy. This means working with campus IT trainers so that in their basic word processing course they give people a suitable template and teach them how to use it, so that all their documents are ready for this system.

    For new students and staff this may be enough. For people who already “know” how to use their word processor, we need to give them an incentive to change their ways. This is the reasoning behind providing an integrated system like the Digital Scholar’s Workbench that gives a range of benefits.

  2. For high-value legacy documents, someone will have to put work into them in order to raise them to a standard suitable for archiving. Perhaps this just means opening them in a word processor and applying the template & styles, or perhaps it means using some new custom-built “digital document archivist’s workbench” software — some integrated combination of an XML editor and document converter and other tools. Another possibility here is rekeying.

  3. For lower-value legacy documents, it won’t be worth putting in all that effort, but we probably still want to preserve them. In this case, applying a simpler conversion process that converts a word processing document to an equivalent XHTML document may be a good option.

