The IUP Journal of Information Technology
An Ensemble Learning Approach to Classifying Documents Based on Formal and Informal Writing Styles

Article Details
Pub. Date : Sep, 2022
Product Name : The IUP Journal of Information Technology
Product Type : Article
Product Code : IJIT020922
Author Name : Kuhaneswaran Banujan and Nirubikaa Ravikumar
Availability : YES
Subject/Domain : Engineering
Download Format : PDF Format
No. of Pages : 23

Price

Download
Abstract

With recent advances in technology, many students and scholars have been tempted to use the Internet as their main educational resource since they can obtain a variety of documents online. These documents can be classified as either formal or informal in writing style, involving different linguistics. The paper presents a method to identify automatically the style of a particular document. First, a dataset of online documents was compiled and preprocessed. Next, features were extracted via a Term Frequency-Inverse Document Frequency vectorizer. Classification models were then built using six classification algorithms. Initially, five Machine Learning algorithms-Random Forest, Decision Tree, Support Vector Machine, Multilayer Perception, and Naive Bayes-were used. Of these five algorithms, the Random Forest algorithm performed best, obtaining an accuracy of 87.44%, high values for precision and recall, and an F-measure with the lowest error rate. In the second experiment, an Ensemble Learning method was used, whereby a Vote algorithm was used with a combination of the five algorithms. This method obtained an accuracy of 91.96%. The method combines several algorithms.


Introduction

The education system in the world today is very competitive. As a result, many students and scholars pursue their studies using the Internet as their main resource. The Internet also makes a significant contribution to evolving education methods. Many education websites provide much educational information relevant to students and scholars, including academic papers, journal papers, educational books and news. The authors used either formal or informal writing styles when preparing these documents.


Keywords

Document classification, Formal style, Informal style, Ensemble Learning, Machine Learning