Some Issues in Document Analysis and Recognition

The IUP Journal of Information Technology :

Article Details

Pub. Date	:	March, 2006
Product Name	:	The IUP Journal of INFORMATION TECHNOLOGY
Product Type	:	Article
Product Code	:	IJIT40603
Author Name	:	K Chandra Sekharaiah and Upakaram Gopal
Availability	:	YES
Subject/Domain	:	Science and Technology
Download Format	:	PDF Format
No. of Pages	:	8

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Description

In this article, various issues related to document processing are explored. Document processing is related with language processing. It has been found that the present document processing software is inadequate to process idioms and phrases in the documents in interlingual translation.

Document processing involves analysis and recognition of the documents under consideration. Mainly, the operations are inserting text, deleting text, cut and paste and other such modifications done to the documents. For instance, in cut and paste, a piece of data is moved to a temporary location. The document often involves different font specifications and changes such as bold, italics, underlining etc. Also, font size and typeface are changed. Often, a document consists of footnotes and cross-references. For a document, automatic insertion of headers is specified at the top of each page of a document. Spell checker is a utility that allows to check the spelling of words. It will highlight all words that it does not recognize. Thus, it is very important in document recognition. Today's document processing software provides features for automatic spelling and grammar checking. For instance, Auto-correct feature of MS-Word corrects most of the typographical errors. The feature can be turned off and on. In [1], design and implementation of a spell checker for Assamese is detailed. A built-in thesaurus allows to search for synonyms without leaving the document processing scenario. Mail-merge systems varying in terms of power and flexibility are a feature in most of the document development systems. This involves usage of two files - one, to store a set of information like a list of names and addresses and another containing the body part of the letter. Special symbols are substituted in the place of names and addresses which will come from the first file. Upon executing merge command, appropriate data from the first file will replace symbols in the second file. With WYSIWYG (what you see is what you get) feature, a document appears on the display screen exactly as it will look when printed. Document recognition and analysis plays a crucial role in a document development system.

This article presents some important issues in document analysis and recognition. The paper is organized in the following lines. In the next section, the various scanners and their importance in the present-day OCR systems is presented. In section 3, various character encoding standards are explained including the emerging Unicode. In section 4, various issues connected to glyphs, fonts and the peculiarities of languages are described. Section 5 describes the usage of idioms and phrases in document preparation and the lacking of any mechanism for the existing document processing software to identify and process them appropriately such as in interlingual document translation. In section 6, details of phonology as regards diphthongs and other sound-processing issues are discussed. In section 7, various memory systems in humans are described. Section 8 describes how recognition and cognition are closely intertwined issues. Section 9 concludes the article.

Keywords

Some Issues in Document Analysis and Recognition, Document processing, interlingual translation, power and flexibility, document analysis, interlingual document translation, sound-processing issues, Mail-merge systems, document development systems.