|
DIGITAL TEXT & DOCUMENTS
A word processor (more formally known as document preparation system)
is a computer application used for the production (including
composition, editing, formatting, and possibly printing) of any sort
of printable material. Word processor may also refer to an obsolete
type of stand-alone office machine, popular in the 1970s and 80s,
combining the keyboard text-entry and printing functions of an
electric typewriter with a dedicated computer for the editing of text.
Although features and design varied between manufacturers and models,
with new features added as technology advanced, word processors for
several years usually featured a monochrome display and the ability to
save documents on memory cards or diskettes. Later models introduced
innovations such as spell-checking programs, increased formatting
options, and dot-matrix printing. As the more versatile combination of
a personal computer and separate printer became commonplace, the word
processor quickly disappeared.
Word processors are descended from early text formatting tools
(sometimes called text justification tools, from their only real
capability). Word processing was one of the earliest applications for
the personal computer in office productivity.
Although early word processors used tag-based markup for document
formatting, most modern word processors take advantage of a graphical
user interface. Most are powerful systems consisting of one or more
programs that can produce any arbitrary combination of images,
graphics and text, the latter handled with type-setting capability.
Microsoft Word is the most widely used computer word processing
system; Microsoft estimates over five hundred million people use the
Office suite, which includes Word. There are also many other
commercial word processing applications, such as WordPerfect, which
dominated the market from the mid-1980s to early-1990s, particularly
for machines running Microsoft's MS-DOS operating system
Preservation of word / text processing documents would depend on
1. Preservation vs. access formats
2. Criteria for sustainability
3. Word processing formats
4. PDF
5. RTF
6. XML
Long t time storage of Paper-manuscript or documents are a major
problem for digital repositories. These need to be converted into an
archival format for preservation. One has to decide as to what file
formats are suitable for long-term storage of word processed text
documents & how one can convert documents into a suitable archival
format.
Majority text documents created today are created in a word processor
file formats generated by word processors. Most of the text we’re
interested in archiving is in one of the various Microsoft Word
formats. Smallest & easiest format is in Wordstar or notepad RTF.
Since word processing formats are not suitable for preservation, many
archives seem to have chosen PDF, but this has serious problems. XML
is a better answer, but it’s not a complete answer. XML is not a file
format, but a meta-format, a framework for creating file formats. We
have to choose a suitable XML file format for storing documents.
There are various methods available for converting word processing
documents into a suitable XML format. There is a lot of published
research on digital preservation, but not much of it that deals in any
detail with preservation of text.
First we need to make an important distinction between preservation
formats and access or viewing formats.
1. Preservation vs. access formats
A preservation format is one suitable for storing a document in an
electronic archive for a long period. An access format is one suitable
for viewing a document or doing something with it.
The vast majority of all text documents created today are created in
Microsoft Word using its native .doc format (in one of its many
variations depending on the version of Word being used). It would be
great if we could just deposit Microsoft Word documents into
repositories and be done with it, but unfortunately that won’t do, for
a few good reasons:
Word format is proprietary. It is owned by Microsoft corporation. Even
the recent Microsoft Word XML-based formats suffer from this. So why
are proprietary formats a bad thing where the owner could choose
to change the format at any time, possibly forcing repositories to
convert all their documents or could change the licensing at any time, perhaps insisting
that documents may only be opened using their software, or that users
pay a fee for reading or editing existing documents.
Except for the recent XML-based versions, Word is a binary format.
There is no obvious way to extract the content from a Word document.
If the document is corrupted even a little, the content can be lost.
Even the most recent version, Microsoft Open XML format, is a
compressed Zip archive of XML files. Compressed files are particularly
prone to major loss if corrupted. Word is not just one format but
many.
Even the new XML-based format has some technical problems.
Microsoft has released their latest XML-based file format, known as
Open XML, publicly, along with assurances that it is and will
always be free.
Open Document Format grew out of OpenOffice.org’s earlier Open Office
XML format. It is now an OASIS and ISO standard and a European
Commission recommendation. It is supported by the open source word
processors KOffice and AbiWord, with more to come.An ODF file is a Zip archive containing several XML files, plus images
and other objects. The Zip archiving and compression tool is freely
available on all major platforms, so there should never be a problem
getting at the content of an ODF document. Using a Zip archive does
mean that the files are prone to catastrophic loss of content with
even minor data corruption, in the same way as the Microsoft Word
formats discussed above.
Word processing formats are at heart about describing the appearance
of the document, not its structure. For serious processing it’s the
structure we want. In 20, 50 or 100 years, most readers will probably
not care about the size of the paper, the margins, the fonts used and
so on. Even today, if we’re going to serve up a document as a web
page, those details are irrelevant. Sometimes these details can even
be a disadvantage, for example if the document insists on fonts that
are unavailable on your computer. On the other hand, the division of
the document into sections will always be relevant, useful and
important, and must be preserved.
There are several, but none of them has much market share, nor do any
of them have any particularly conspicuous advantages. Probably the
best strategy with these is to convert them into Word or Open Document
Format, then treat them in the same way as the majority of documents.
OpenOffice.org will open many file formats, so it can be used as a
generic first stage in any process of converting documents into useful
formats. Use OpenOffice.org in server mode to open all documents and
save them in Open Document Format, then process them into something
better.
Many repositories seem to have adopted PDF
as their main format for
text documents, both for storage and for access. PDF has some good
points:
It is easy to create, either using Adobe Acrobat software or using the
PDF Export feature available in both Microsoft Word and OpenOffice.org
Writer.
It can be viewed on all platforms using the free Adobe Acrobat Reader
software (with some caveats, see below).
It is extremely effective at preserving the formatting of a document.
For some applications (for example in legal contexts) this may be of
vital importance.
However, there are some serious problems with using PDF as a storage
format:
*
The format is owned by Adobe. While it is currently open, the company
could decide to keep future versions secret,
There are some compatibility problems between different versions.
Documents may rely on system fonts. There is an option in PDF to embed
all fonts in the document, but not all software uses this, and some
PDF viewing software either cannot locate the correct fonts or doesn’t
know how to substitute suitable alternatives. Failing to embed all
fonts can result in a serious degradation of the on-screen appearance
of a document, or in a complete failure to display the content.
PDF includes extra features like encryption, compression, digital
rights management and embedding of objects from other software
packages. These all present difficulties, particularly the last.
PDF is an excellent access format for printing to paper. Any good
preservation system should be able to generate PDF renditions of
documents for this purpose. PDF is not so good for viewing on screen,
as it ties document content to a fixed page size. This means that for
large page sizes or small screens (e.g. on handheld devices like PDAs
or mobile phones) text will either be too small to read or the user
will have to scroll back and forth along the lines, which is highly
inconvenient. Looking ahead, who knows what viewing formats we will
use. We need to be able to reformat content to fit the viewing device.
RTF (
Rich Text
Format)
RTF stands for Rich Text Format. It is a Microsoft specification[17],
but they have published it, so one could argue that it is an open
standard. It is certainly widely interoperable, with most word
processors capable of reading and writing RTF. There are problems with
using RTF as a preservation format:
It is still defined by a corporation, with all the risks that entails.
There seem to be parts of the specification that are not in the
publicly available specification document, and which have changed over
the years.
The specification is not complete and precise, leaving many little
quirks.
The National Library of Australia has chosen RTF as its main
preservation format[5]. I think a well-chosen XML file format has
significant advantages over RTF, but it might well be worth retaining
RTF as an access format, since it has good interoperability.
XML
XML is widely accepted as a desirable format for document
preservation. See for example the assessment of XML on the US Library
of Congress digital formats web site and the related conference
paper by Arms & Fleischauer. The reasons are simple: ,XML is a free, open standard.
XML uses standard character encodings, including full support for
Unicode. This makes it capable of describing almost anything in any
language.
XML is based on plain text. This gives it the best possible chance of
being readable far into the future. Even if XML and XSLT are no longer
available, the raw document content and markup will still be
human-readable. (This will be true even if the meaning of the markup
has been lost, although formats designed with preservation in mind
should make the meaning more or less apparent from the carefully
chosen element and attribute names).
TEI (Text
Encoding Initiative)
TEI stands for the Text Encoding Initiative. Its guidelines are
aimed mostly at the preservation of literary and linguistic texts (so
a very different slant to DocBook). Like DocBook, TEI is huge.
Furthermore, it’s not exactly a format, but a set of guidelines for
building more specialized formats. One such is TEI-Lite, which has
proved very popular, and is used by several serious repositories.
TEI may be better-matched than DocBook to some scholarly work,
particularly in the humanities. It does have some serious shortcomings
however:
Authors create documents in a word processor (either Word or Writer),
using a generic template. They must use styles, and only the special
styles in the template, not the standard built-in styles. The key to
effective web publishing like this is to have a fast feedback loop.
Instead of authors sending their work to a web publisher and getting
the result back weeks later, they save their document and click
"Refresh" in their browser to see the results. If they have done
something wrong, they see it straight away.
|