Offensive Language Detection using Machine Learning Test phase, srs, design phase and source code final deliverable

Category

Data Science/Machine Learning/Web Programming

Abstract / Introduction

Offensive language is the offense of using language in a manner that is likely to cause offense to a reasonable person in, near, or within hearing or sight of a public place. It consists of behavior, intended to hurt a person’s feelings, or to cause anger or resentment, or hatred.

As we live in an age of technology where most of us have easy access to the Internet. Due to the increasing use of the Internet, the use of social media, especially for communication, has increased dramatically in recent years. But this advancement also opens the door to trolls who poison social media and forums with their abusive behavior toward other. Therefore, detection of abusive language online is becoming a major issue.

In this project, student will detect offensive language and find accuracy by applying appropriate machine learning techniques (such as SVM, Tree and Random, etc.) to offensive language comment datasets. Students will also compare which technique is best for Offensive Language Detection and why.

Functional Requirements:

Admin will perform all these (Functional Requirements) tasks.

Data-Collection
- For this project, student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect offensive language. Data set must contain at least 2000 comments. The data set is shared in the link below for the idea.
Pre-Processing
- As most of the data in the real world are incomplete containing noisy and missing values. Therefore student have to apply pre-processing on data. In pre-processing, student will normalize the data set, handle stop words, missing values, and noise & outliers, and remove duplicate values.
Feature Extraction
- After the pre-processing step, student will apply the feature extraction method. Student can use Term Frequency – Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method.
Train & Test Data
- Split data into 66% training and 34% testing data sets.
Machine learning Techniques
- Student must use at least tree classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree and Random Forest Tree) of tree different machine learning techniques/algorithms.
Confusion Matrix
- Create a confusion matrix table to describe the performance of a classification model.
Accuracy Evaluation
- Find the accuracy of all techniques and compare their accuracy.
- This project will also tell us which machine learning technique is better to detect offensive language.