NLP based Duplicate Bug Report Detection using Supervised Machine Learning Algorithms Test phase, srs, design phase and source code final deliverable

Project Domain / Category

AI/Machine Learning/Prototype base

Abstract / Introduction

A bug report is a technical document that contains all the necessary information about the bug and the conditions under which it can be reproduced. It is a guide for the developers and the team engaged in fixing the bug. Bug reports are the primary means through which developers triage and fix bugs. To achieve this effectively, bug reports need to clearly describe those features that are important for the developers. However, previous studies have found that reporters do not always provide such features.

Our objective in this project is to Classify such bug reports using machine learning models on the given dataset. Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written referred to as natural language. Natural language processing uses artificial intelligence to take real-world input, process it, and make sense of it in a way a computer can understand. NLP perform data preprocessing (Tokenization, Stop word removal, Lemmatization, etc…) which involves clearing textual data for machine to be able to analyze it. To classify the duplicate bug reports we use machine learning algorithms such as Naïve Bayes, Support Vector Machine and Random Forest.

Pre-Requisites:

This project is easy and interesting but requires in depth study of machine learning, natural language processing techniques. The following link may help you better understand:

Text Classification Tutorial: https://www.youtube.com/watch?v=sm0NoO5aYC0

Dataset: https://github.com/logpai/bugrepo/tree/master/Thunderbird

Functional Requirements:

The following are the functional requirements of the project:

System must be set the environment online/offline (If Required)
System apply different data processing techniques (Tokenization, Stop word removal, Lemmatization, etc…)
System must Build Corpus
System must be split the given dataset into testing and training.
System must trained the specified model.
User must be evaluate mentioned models in the form of Confusion Matrix, Accuracy, Precision, Recall
User must be discussed the results of given algorithms ( Naïve Bayes, Support Vector Machine, Random Forest)
User must retrained the model if accuracy is not good (less than 60%) by changing different training parameters (If Required)

…………

Tools:

Language: Python (Only python language)
IDE: JupyterNotebook, Google Colab, Pycharm, Spyder, etc…

Supervisor:

Name: Sadeem Ahmad Nafees

Leave a Comment Cancel Reply