Web mining can be generally defined as an application that uses data mining to automatically discover and analyze useful information from numerous resources available in the form of documents or database on the World Wide Web (www). In other words, it is to extract and mine useful information from web. The data sources that are supported can be heterogeneous, dispersed, or even distributed.
Web mining thus, starts with resource discovery, information extraction from the appropriate resources identified, generalization (finding general patterns in the websites or across web pages) and finally analysis of the extracted information. Web Content Mining: This includes the automatic search on the www for content that is data. The data can be in any form—unstructured (simple text data), structured (HTML pages generated by databases or using XML) or even semi-structured HTML files). The web data can be in any format—text, image, audio, video, etc., though in the initial stages of web content mining, it is limited just to text documents. Web content mining can be categorized as agent-based approach or database approach. Agent-based approach uses the technique of agents—intelligent or personalized, for information identification, collection, and retrieval. The database approach concentrates on transforming the semi-structured data available on the eb into structured collections of resources, using standard database querying mechanisms(1). Data mining techniques are then used to analyze the data.
Web Structure Mining: This branch deals with the mining of the web structure. The first method of web structure mining is the mining of the hyperlink and classification of the websites and the web pages according to the information derived from mining the hyperlink. The second type of mining is mining the structure of the web page itself, to analyze the inter-page relations, again using the hyperlink that the pages are connected through. The standard techniques of web structure mining is Google's PageRank, CLEVER, HITS, etc. |