Offensive Language Detection using Machine Learning Test phase, srs, design phase and source code final deliverable

Offensive Language Detection using Machine Learning Test phase, srs, design phase and source code final deliverable

Category

Data Science/Machine Learning/Web Programming

Abstract / Introduction

Offensive language is the offense of using language in a manner that is likely to cause offense to a reasonable person in, near, or within hearing or sight of a public place. It consists of behavior, intended to hurt a person’s feelings, or to cause anger or resentment, or hatred.

As we live in an age of technology where most of us have easy access to the Internet. Due to the increasing use of the Internet, the use of social media, especially for communication, has increased dramatically in recent years. But this advancement also opens the door to trolls who poison social media and forums with their abusive behavior toward other. Therefore, detection of abusive language online is becoming a major issue.

In this project, student will detect offensive language and find accuracy by applying appropriate machine learning techniques (such as SVM, Tree and Random, etc.) to offensive language comment datasets. Students will also compare which technique is best for Offensive Language Detection and why.

Functional Requirements:

Admin will perform all these (Functional Requirements) tasks.

  1. Data-Collection
    • For this project, student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect offensive language. Data set must contain at least 2000 comments. The data set is shared in the link below for the idea.
  2. Pre-Processing
    • As most of the data in the real world are incomplete containing noisy and missing values. Therefore student have to apply pre-processing on data. In pre-processing, student will normalize the data set, handle stop words, missing values, and noise & outliers, and remove duplicate values.
  3. Feature Extraction
    • After the pre-processing step, student will apply the feature extraction method. Student can use Term Frequency – Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method.
  4. Train & Test Data
    • Split data into 66% training and 34% testing data sets.
  5. Machine learning Techniques
    • Student must use at least tree classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree and Random Forest Tree) of tree different machine learning techniques/algorithms.
  6. Confusion Matrix
    • Create a confusion matrix table to describe the performance of a classification model.
  7. Accuracy Evaluation
    • Find the accuracy of all techniques and compare their accuracy.
    • This project will also tell us which machine learning technique is better to detect offensive language.

Tools:

  • Anaconda (Python distribution platform)
  • Jupiter Notebook (Open source web application)
  • Python (programming language)
  • Machine Learning (Technique)

Prerequisite:

Artificial Intelligence, Machine Learning, and Natural Language Processing Concepts,

“Students will cover a short course relevant to the mentioned concepts besides SRS and Design initial documentation or see the links below.”

Helping Material

Machine Learning Techniques:

https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0 https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-

149374935f3c

https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientistshould-know-3cc96e0eeee9

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

https://www.youtube.com/watch?v=fG4e4TUrJ3E https://www.youtube.com/watch?v=7eh4d6sabA0 Feature Extraction Method: https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/ https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-onreal-world-dataset-796d339a4089

https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlpa-beginners-guide-to-understand-natural-language-processing/ http://uc-r.github.io/creating-text-features

Dataset:

https://drive.google.com/file/d/1Jq62ErAQiMpWfEz9_DwSkjmyYdmwWWu6/view?usp=shari ng

Supervisor:

Name: Tayyab Waqar

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
×

Hello!

Click one of our contacts below to chat on WhatsApp

× WhatsApp Us