Clustering puts together those words, sentences or documents that denote the same
concepts, themes or topics. As sentences contain less information than documents,
fewer features can be employed in sentence clustering. Similarity computation among
sentences is more challenging than among documents. One approach is to cluster the
documents after retrieval and present a synopsis of each cluster so that a user can
choose clusters of interest. Various clustering approaches employing spherical k-means
algorithm (Dhillon and Modha, 2001), clustering high-dimensional and sparse text data
adopting ‘first-variation’ by moving data points between clusters using incremental
approach (Dhillon et al., 2002) is quite popular. Text categorization is a problem typically
formulated as a learning task, where a classifier learns how to distinguish between
categories in a given set, using features automatically extracted from a collection of
training documents (Rada and Samer, 2005). Nowadays, various information sources are accessible through the Internet. News sites are especially useful information sources. A number of commercially available news service providers like NewsIs-Free
(newsisfree.com), Internet news services (e.g., AltaVistaNews or Google News) present
clusters of related articles, allowing readers to easily find all stories on a given topic.
However, these services do not produce summaries - a reader seeking a quick topic
overview must choose between selecting a representative article to read in full or
skimming through all articles. The answer to this problem is producing automatic
generic text summaries. This is precisely the issue we are concerned with in this paper.
We have extracted newspaper reports from major newspaper organizations like The
Hindu, Indian Express and Deccan Herald and other news services like sify, google and
yahoo. Categorization of news articles is the first step towards enabling this.
Summarization of a document is similar to the précis - writing of a document in
which the reader prepares his own summary after reading the whole document.
Summary is nothing but the condensed form of information in shorter versions thus
preserving the originality of the documents. The main objective of this paper is to
create a summarizer, whose goal is to produce a shorter version of a source text, while
still retaining its main semantic content. The job of the summarizer is to get back the
informative information from various sources. |