The IUP Journal of Soft Skills
Reflecting Design Considerations: An End-to-End Case Study on Preparing Cricket Data Available on Net Analysis Ready

Article Details
Pub. Date : Sep, 2018
Product Name : The IUP Journal of Soft Skills
Product Type : Article
Product Code : IJIT11809
Author Name : Subhasis Ray and Kalyan Sengupta
Availability : YES
Subject/Domain : Management
Download Format : PDF Format
No. of Pages : 34

Price

Download
Abstract

The use of Internet as a source of secondary data is becoming more popular day by day. Websites are made up of webpages that contain a huge volume of useful information in textual form. However, webpages are coded using text-based mark-up languages (e.g., HTML, XHTML, XML, etc.) to facilitate end-user viewing rather than any automated use of them. This has led to a new science called web scraping that fetches webpages and then extracts data for future use. Many organizations have picked up this business opportunity to come up with efficient web scraping tools. The paper exposes the readers to how data can be sourced from the internet for scientific or commercial purpose. This elaborates on the available design options for data fetching, extracting, validating and transforming in the absence of any end-to-end tool or to supplement a tool. This is followed up by a specific case study which deals with reactive analysis of structured data from multiple predetermined sources/pages. This paper concludes that design considerations for web scraping have to be dynamic. Neither traditional copy-and-paste nor trapping feeds using Application Programming Interfaces (API) nor Java, Python or R programming nor the end-to-end tool available is uniformly better than the rest.


Description

Over the years, WWW has become a repository of huge volumes of data. As a consequence, businesses, institutions and researchers are trying to utilize these "free" data for their "benefit". Keeping pace with this, technological advancements are taking place for accessing such data in the smartest possible way. However, these are secondary data which demand every bit of validation for necessary alignment of objectives between the source and end users. The basic underlying framework for accessing such data is provided in Figure 1.


Keywords

Web scraping, Data preprocessing, Data transformation, Structured Query Language (SQL), Java, Python, R, Game of cricket