NLP Project - DLBAIPNLP01

This is the code repository to support the project report for module: DLBAIPNLP01 – Project: NLP at I.U. International University of Applied Sciences.

Overview

In this project, we perform sentiment analysis on the Large Movie Dataset. We use two preprocessing methodologies and train two linear classifiers (LogisticRegression and LinearSVC) and a probabilistic classifier (MultinomialNB).

Repository Contents (After Running the notebooks !)

aclImdb

This folder contains the dataset. It contains two subfolders, 'train' and 'test' which in turn contain samples separated as 'pos' and 'neg'. This folder is created when run the Preprocessing.ipynb.

mlruns

This folder will contain the MLflow runs when running the Train_Models.ipynb.

models

Here the trained models will be saved.

plots

Contains the model calibration plots produced by Checks.ipynb.

processed_datasets

The preprocessed datasets are saved here for faster access.

results_from_checks

The manual error analysis results are saved here.

Checks.ipynb

This notebook is used to perform model calibration checks and manual error analysis.

Preprocessing.ipynb

This notebook preprocess the data, saves the results and shows a rough statistical analysis between the two preprocessing methodologies.

Text_Aspects_Statistics.ipynb

This notebook is used to perform an initial statistical analysis on the data, to decide upon the preprocessing methodology.

Train_models.ipynb

This notebook trains the models using MLflow and GridSearchCV, to record the experiments and find the optimal model parameters.

Dataset

The dataset is the Stanford Large Movie Review Dataset which can be found here: https://ai.stanford.edu/~amaas/data/sentiment/

Getting Started

To start with, you need to run the Preprocessing.ipynb notebook as this will download the dataset. The code in the section Load Data downloads the dataset file and extracts it, creating the folder acLImdb.

After that you can run the rest of the Preprocessing.ipynb or the Text_Aspects_Statistics.ipynb.

To run the Train_models.ipynb you need to run all the cells in the Preprocessing.ipynb, up until it creates the folder processed_datasets, where the preprocessed data are saved and used in model training.
To run the Checks.ipynb you need to first train the models by running the Train_models.ipynb, which will create the mlruns and models directories, from where you can fetch the trained models to perform model calibration checks and manual error analysis.
The Checks.ipynb creates the plots and results_from_checks directories.

Acknowledgments

I.U. International University of Applied Sciences

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Checks.ipynb		Checks.ipynb
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
Text_Aspects_Statistics.ipynb		Text_Aspects_Statistics.ipynb
Train_models.ipynb		Train_models.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Project - DLBAIPNLP01

Overview

Repository Contents (After Running the notebooks !)

aclImdb

mlruns

models

plots

processed_datasets

results_from_checks

Checks.ipynb

Preprocessing.ipynb

Text_Aspects_Statistics.ipynb

Train_models.ipynb

Dataset

Getting Started

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Project - DLBAIPNLP01

Overview

Repository Contents (After Running the notebooks !)

aclImdb

mlruns

models

plots

processed_datasets

results_from_checks

Checks.ipynb

Preprocessing.ipynb

Text_Aspects_Statistics.ipynb

Train_models.ipynb

Dataset

Getting Started

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages