Unsupervised Learning Project

Author: J. Wong
Date: 2026-03-10

Overview

This project explores unsupervised and supervised learning techniques on both tabular and image datasets, emphasizing dimensionality reduction, clustering, visualization, and predictive modeling. The main objectives are to uncover hidden patterns, reduce high-dimensional data into interpretable forms, and engineer features that improve model performance.

We use datasets such as the Spotify Tracks Dataset and Sign Language MNIST to demonstrate a variety of techniques, including:

Dimensionality Reduction: PCA, KernelPCA, t-SNE
Clustering: KMeans, DBSCAN, GaussianMixture, MeanShift, Agglomerative Clustering
Predictive Modeling: LogisticRegression combined with cluster-based features
Evaluation Metrics: ROC-AUC, Accuracy

Throughout the workflow, we emphasize visual interpretability using scatter plots, 3D embeddings, cluster reconstructions, and feature importance visualizations.

Structure of the Notebook

The notebook is divided into the following sections:

Imports & Preprocessing: Standardizing features, encoding categorical variables, and subsampling data for efficiency.
KMeans Clustering: Pixel-level clustering of images and feature clustering for tabular data.
Principal Component Analysis (PCA): Dimensionality reduction and visualization of explained variance.
DBSCAN: Density-based clustering to identify noise and discover clusters in t-SNE embeddings.
t-SNE: Non-linear embedding for visualizing high-dimensional structures in 2D and 3D.
Mean Shift Clustering: Identifying cluster centers and analyzing cluster-level feature statistics.
Agglomerative Clustering: Hierarchical clustering with dendrogram visualizations.
Explained Variances & Feature Importance: Evaluating PCA components and their contribution to feature representation.
Conclusion & Next Steps: Summarizes insights and suggests potential improvements or extensions.

Key Insights

PCA and KernelPCA can compress high-dimensional data while retaining significant variance.
Clustering techniques reveal meaningful groupings in both tabular and image data.
t-SNE provides intuitive visualizations for complex, high-dimensional structures.
Cluster-based features can enhance predictive models like LogisticRegression.
DBSCAN and MeanShift effectively identify noise and natural clusters without specifying n_clusters.

Next Steps

Future work could include:

Experimenting with other kernels in KernelPCA or different distance metrics in clustering.
Scaling up the workflow to full datasets for more robust patterns.
Combining unsupervised embeddings with deep learning models for richer feature representations.
Evaluating feature selection methods to further improve model interpretability and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Unsupervised Learning.ipynb		Unsupervised Learning.ipynb
toronto.jpeg		toronto.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Learning Project

Overview

Structure of the Notebook

Key Insights

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning Project

Overview

Structure of the Notebook

Key Insights

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages