Computer Sciences Journal | Experiments on Clustering and Multi-Document Summarization

The IUP Journal of Computer Sciences :

Experiments on Clustering and Multi-Document Summarization

Article Details

Pub. Date	:	April, 2009
Product Name	:	The IUP Journal of Computer Sciences
Product Type	:	Article
Product Code	:	IJCS60904
Author Name	:	Maruthamuthu, Maheedharan, Kirubakaran and Shanmugasundaram Hariharan
Availability	:	YES
Subject/Domain	:	Management
Download Format	:	PDF Format
No. of Pages	:	10

Price

For delivery in electronic format: Rs. 50; For delivery through courier (within India): Rs. 50 + Rs. 25 for Shipping & Handling Charges

Download

To download this Article click on the button below:

Abstract

Commercially, there are many newspaper sites that produce news reports for users in different forms and style. The content remains the same in all such forms. Hence, an effective summary might be a solution to save the end user time in reading all the reports. Although all exiting automatic multi-document summarizers generate summaries from multiple clusters, it is quite challenging to identify such strongly related clusters. This paper proposes a novel framework for clustering the documents that are domain independent. It focuses in depth on the process of clustering the documents from large volumes of data available. The paper also investigates the optimal threshold that can cluster the documents effectively. It also proposes a summarization procedure for the clustered documents. The results are promising and have some significance in multi-document cluster generation.

Description

Clustering puts together those words, sentences or documents that denote the same concepts, themes or topics. As sentences contain less information than documents, fewer features can be employed in sentence clustering. Similarity computation among sentences is more challenging than among documents. One approach is to cluster the documents after retrieval and present a synopsis of each cluster so that a user can choose clusters of interest. Various clustering approaches employing spherical k-means algorithm (Dhillon and Modha, 2001), clustering high-dimensional and sparse text data adopting ‘first-variation’ by moving data points between clusters using incremental approach (Dhillon et al., 2002) is quite popular. Text categorization is a problem typically formulated as a learning task, where a classifier learns how to distinguish between categories in a given set, using features automatically extracted from a collection of training documents (Rada and Samer, 2005). Nowadays, various information sources are accessible through the Internet. News sites are especially useful information sources. A number of commercially available news service providers like NewsIs-Free (newsisfree.com), Internet news services (e.g., AltaVistaNews or Google News) present clusters of related articles, allowing readers to easily find all stories on a given topic. However, these services do not produce summaries - a reader seeking a quick topic overview must choose between selecting a representative article to read in full or skimming through all articles. The answer to this problem is producing automatic generic text summaries. This is precisely the issue we are concerned with in this paper. We have extracted newspaper reports from major newspaper organizations like The Hindu, Indian Express and Deccan Herald and other news services like sify, google and yahoo. Categorization of news articles is the first step towards enabling this.

Summarization of a document is similar to the précis - writing of a document in which the reader prepares his own summary after reading the whole document. Summary is nothing but the condensed form of information in shorter versions thus preserving the originality of the documents. The main objective of this paper is to create a summarizer, whose goal is to produce a shorter version of a source text, while still retaining its main semantic content. The job of the summarizer is to get back the informative information from various sources.

Keywords

Computer Sciences Journal, Multi-Document Summarization, Clustering, Research and Development, Recent Advances in Natural Language Processing, RANLP, Knowledge Management, Data Mining, IEEE International Conference, Science and Technology, Frequency-based Approach, Vector Space Model , VSM.