Offensive Language Detection using Machine Learning Test phase, srs, design phase and source code final deliverable
Category
Data Science/Machine Learning/Web Programming
Abstract / Introduction
Offensive language is the offense of using language in a manner that is likely to cause offense to a reasonable person in, near, or within hearing or sight of a public place. It consists of behavior, intended to hurt a person’s feelings, or to cause anger or resentment, or hatred.
As we live in an age of technology where most of us have easy access to the Internet. Due to the increasing use of the Internet, the use of social media, especially for communication, has increased dramatically in recent years. But this advancement also opens the door to trolls who poison social media and forums with their abusive behavior toward other. Therefore, detection of abusive language online is becoming a major issue.
In this project, student will detect offensive language and find accuracy by applying appropriate machine learning techniques (such as SVM, Tree and Random, etc.) to offensive language comment datasets. Students will also compare which technique is best for Offensive Language Detection and why.
Functional Requirements:
Admin will perform all these (Functional Requirements) tasks.
- Data-Collection
- For this project, student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect offensive language. Data set must contain at least 2000 comments. The data set is shared in the link below for the idea.
- Pre-Processing
- As most of the data in the real world are incomplete containing noisy and missing values. Therefore student have to apply pre-processing on data. In pre-processing, student will normalize the data set, handle stop words, missing values, and noise & outliers, and remove duplicate values.
- Feature Extraction
- After the pre-processing step, student will apply the feature extraction method. Student can use Term Frequency – Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method.
- Train & Test Data
- Split data into 66% training and 34% testing data sets.
- Machine learning Techniques
- Student must use at least tree classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree and Random Forest Tree) of tree different machine learning techniques/algorithms.
- Confusion Matrix
- Create a confusion matrix table to describe the performance of a classification model.
- Accuracy Evaluation
- Find the accuracy of all techniques and compare their accuracy.
- This project will also tell us which machine learning technique is better to detect offensive language.
Tools:
- Anaconda (Python distribution platform)
- Jupiter Notebook (Open source web application)
- Python (programming language)
- Machine Learning (Technique)
Prerequisite:
Artificial Intelligence, Machine Learning, and Natural Language Processing Concepts,
“Students will cover a short course relevant to the mentioned concepts besides SRS and Design initial documentation or see the links below.”
Helping Material
Machine Learning Techniques:
https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0 https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-
149374935f3c
https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientistshould-know-3cc96e0eeee9
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
https://www.youtube.com/watch?v=fG4e4TUrJ3E https://www.youtube.com/watch?v=7eh4d6sabA0 Feature Extraction Method: https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/ https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-onreal-world-dataset-796d339a4089
https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlpa-beginners-guide-to-understand-natural-language-processing/ http://uc-r.github.io/creating-text-features
Dataset:
https://drive.google.com/file/d/1Jq62ErAQiMpWfEz9_DwSkjmyYdmwWWu6/view?usp=shari ng
Supervisor:
Name: Tayyab Waqar