This is the code repository to support the project report for module: DLBAIPNLP01 – Project: NLP at I.U. International University of Applied Sciences.
In this project, we perform sentiment analysis on the Large Movie Dataset. We use two preprocessing methodologies and train two linear classifiers (LogisticRegression and LinearSVC) and a probabilistic classifier (MultinomialNB).
This folder contains the dataset. It contains two subfolders, 'train' and 'test' which in turn contain samples separated as 'pos' and 'neg'. This folder is created when run the Preprocessing.ipynb.
This folder will contain the MLflow runs when running the Train_Models.ipynb.
Here the trained models will be saved.
Contains the model calibration plots produced by Checks.ipynb.
The preprocessed datasets are saved here for faster access.
The manual error analysis results are saved here.
This notebook is used to perform model calibration checks and manual error analysis.
This notebook preprocess the data, saves the results and shows a rough statistical analysis between the two preprocessing methodologies.
This notebook is used to perform an initial statistical analysis on the data, to decide upon the preprocessing methodology.
This notebook trains the models using MLflow and GridSearchCV, to record the experiments and find the optimal model parameters.
The dataset is the Stanford Large Movie Review Dataset which can be found here: https://ai.stanford.edu/~amaas/data/sentiment/
To start with, you need to run the Preprocessing.ipynb notebook as this will download the dataset. The code in the section Load Data downloads the dataset file and extracts it, creating the folder acLImdb.
After that you can run the rest of the Preprocessing.ipynb or the Text_Aspects_Statistics.ipynb.
-
To run the
Train_models.ipynbyou need to run all the cells in thePreprocessing.ipynb, up until it creates the folderprocessed_datasets, where the preprocessed data are saved and used in model training. -
To run the
Checks.ipynbyou need to first train the models by running theTrain_models.ipynb, which will create themlrunsandmodelsdirectories, from where you can fetch the trained models to perform model calibration checks and manual error analysis. -
The
Checks.ipynbcreates theplotsandresults_from_checksdirectories.
- I.U. International University of Applied Sciences