The IUP Journal of Information Technology
Machine Learning-Based Approach for Classifying the Source Code Using Programming Keywords

Article Details
Pub. Date : Mar, 2022
Product Name : The IUP Journal of Information Technology
Product Type : Article
Product Code : IJIT10322
Author Name : Mohamed Ifham*, BTGS Kumara** and Kuhaneswaran Banujan***
Availability : YES
Subject/Domain : Engineering
Download Format : PDF Format
No. of Pages : 19

Price

Download
Abstract

The implementation phase is one of the most critical periods in software development. Developers build their source code or reuse old source code functionalities concerning the requirement of the system. Most developers spend more time searching and navigating old source codes than developing them. It is essential to have an efficient method to search source code functionality within a short period. Topic modeling of source code is an approach used to extract topics from source codes. Many topic modeling approaches have been implemented using statistical techniques, which have many setbacks. Those results rely on non-formal code elements such as identifier names, comments, etc. Our novel approach is implemented using a machine-learning algorithm to address these issues. The source code functionality results depend only on the algorithm or the syntax of the source code. Three Java project functionalities, such as prime number, Fibonacci number, and selection sort were evaluated in this study. Java parser library is used to derive the source code elements, and an algorithm is created to take the count matrix of the source code features. Then the dataset was fed to three models-Artificial Neural Network (ANN), Random Forest (RF), and Ensemble Approach. It was found that the Ensemble Approach showed a 96.7% accuracy by surpassing ANN and RF.


Introduction

Source code is the core component of a software program. It is easily readable and understandable to a human being. Each programming language has its syntaxes. Programming language syntax provides some rules which govern the architecture of a programming language's characters, algorithm, and keywords. The semantics of a programming language is practically hard to comprehend without syntax. Many


Keywords

Artificial Neural Network (ANN), Random Forest (RF), Ensemble approach, Source code, Classification, Programming keywords