Abusive Language Detection using Machine Learning Test phase, srs, design phase and source code final deliverable

/ Category

Data Science/Machine Learning/Web Programming

Abstract / Introduction

Abusive language is an expression that contains abusive or dirty words in conversation. With the rise of social media culture, there are millions of comments on posts uploaded every day, which has also led to a rapid increase in the use of offensive language in user comments. Abusive language in online comments initiates cyberbullying that targets individuals (celebrities and politicians, etc.) and groups of people (certain countries, ages, and religions). Therefore, it is important to analyze and detect abusive language from online comments automatically. The admin (student) will develop a system to detect abusive language and find accuracy by applying appropriate machine learning techniques (such as Support Vector Machine, Bayes, Tree and Random, etc.) to abusive language comment datasets. The system will also compare which techniques are best for detecting abusive language and why.

Functional Requirements:

Admin (Student) will perform all these (Functional Requirements) tasks.

Data-Collection
- Collect data from any social media platform (such as Facebook, Twitter, Instagram or YouTube) to detect abusive language. Dataset must contain at least 4000 comments. The data set is shared in the link below for the idea.
Pre-processing
- As most of the data in the real world are incomplete containing noisy and missing values. So apply pre-processing to the data. In pre-processing, admin will normalize the data set, handle stop words, missing values, and noise & outliers, and remove duplicate values.
Feature Extraction
- Apply feature extraction method (Frequency – Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method).
Train & Test Data
- Split data into 70% training and 30% testing data sets.
Machine learning Techniques
- Apply at least three classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree or Random Forest Tree etc.) of three different machine learning techniques/algorithms.
Confusion Matrix
- Create a confusion matrix table to describe the performance of a classification model.
Accuracy Evaluation
- Find the accuracy of all techniques and compare their accuracy.
- This project will also tell us which machine learning technique is better to detect abusive language.