Skip to content

Reasonant/Binary-Classification-of-movie-reviews-based-on-Sentiment

Repository files navigation

NLP Project - DLBAIPNLP01

This is the code repository to support the project report for module: DLBAIPNLP01 – Project: NLP at I.U. International University of Applied Sciences.

Overview

In this project, we perform sentiment analysis on the Large Movie Dataset. We use two preprocessing methodologies and train two linear classifiers (LogisticRegression and LinearSVC) and a probabilistic classifier (MultinomialNB).

Repository Contents (After Running the notebooks !)

aclImdb

This folder contains the dataset. It contains two subfolders, 'train' and 'test' which in turn contain samples separated as 'pos' and 'neg'. This folder is created when run the Preprocessing.ipynb.

mlruns

This folder will contain the MLflow runs when running the Train_Models.ipynb.

models

Here the trained models will be saved.

plots

Contains the model calibration plots produced by Checks.ipynb.

processed_datasets

The preprocessed datasets are saved here for faster access.

results_from_checks

The manual error analysis results are saved here.

Checks.ipynb

This notebook is used to perform model calibration checks and manual error analysis.

Preprocessing.ipynb

This notebook preprocess the data, saves the results and shows a rough statistical analysis between the two preprocessing methodologies.

Text_Aspects_Statistics.ipynb

This notebook is used to perform an initial statistical analysis on the data, to decide upon the preprocessing methodology.

Train_models.ipynb

This notebook trains the models using MLflow and GridSearchCV, to record the experiments and find the optimal model parameters.

Dataset

The dataset is the Stanford Large Movie Review Dataset which can be found here: https://ai.stanford.edu/~amaas/data/sentiment/

Getting Started

To start with, you need to run the Preprocessing.ipynb notebook as this will download the dataset. The code in the section Load Data downloads the dataset file and extracts it, creating the folder acLImdb.

After that you can run the rest of the Preprocessing.ipynb or the Text_Aspects_Statistics.ipynb.

  • To run the Train_models.ipynb you need to run all the cells in the Preprocessing.ipynb, up until it creates the folder processed_datasets, where the preprocessed data are saved and used in model training.

  • To run the Checks.ipynb you need to first train the models by running the Train_models.ipynb, which will create the mlruns and models directories, from where you can fetch the trained models to perform model calibration checks and manual error analysis.

  • The Checks.ipynb creates the plots and results_from_checks directories.

Acknowledgments

  • I.U. International University of Applied Sciences

About

This repository contains the files for a project for the course Project: NLP of the IU International University of Applied Sciences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors