In this article, various issues related to document processing are explored. Document processing is related with language processing. It has been found that the present document processing software is inadequate to process idioms and phrases in the documents in interlingual translation.
Document
processing involves analysis and recognition of the documents
under consideration. Mainly, the operations are inserting
text, deleting text, cut and paste and other such modifications
done to the documents. For instance, in cut and paste, a piece
of data is moved to a temporary location. The document often
involves different font specifications and changes such as
bold, italics, underlining etc. Also, font size and typeface
are changed. Often, a document consists of footnotes and cross-references.
For a document, automatic insertion of headers is specified
at the top of each page of a document. Spell checker is a
utility that allows to check the spelling of words. It will
highlight all words that it does not recognize. Thus, it is
very important in document recognition. Today's document processing
software provides features for automatic spelling and grammar
checking. For instance, Auto-correct feature of MS-Word corrects
most of the typographical errors. The feature can be turned
off and on. In [1], design and implementation of a spell checker
for Assamese is detailed. A built-in thesaurus allows to search
for synonyms without leaving the document processing scenario.
Mail-merge systems varying in terms of power and flexibility
are a feature in most of the document development systems.
This involves usage of two files - one, to store a set of
information like a list of names and addresses and another
containing the body part of the letter. Special symbols are
substituted in the place of names and addresses which will
come from the first file. Upon executing merge command, appropriate
data from the first file will replace symbols in the second
file. With WYSIWYG (what you see is what you get) feature,
a document appears on the display screen exactly as it will
look when printed. Document recognition and analysis plays
a crucial role in a document development system.
This
article presents some important issues in document analysis
and recognition. The paper is organized in the following lines.
In the next section, the various scanners and their importance
in the present-day OCR systems is presented. In section 3,
various character encoding standards are explained including
the emerging Unicode. In section 4, various issues connected
to glyphs, fonts and the peculiarities of languages are described.
Section 5 describes the usage of idioms and phrases in document
preparation and the lacking of any mechanism for the existing
document processing software to identify and process them
appropriately such as in interlingual document translation.
In section 6, details of phonology as regards diphthongs and
other sound-processing issues are discussed. In section 7,
various memory systems in humans are described. Section 8
describes how recognition and cognition are closely intertwined
issues. Section 9 concludes the article. |