Skip to content

epfl-ada/ada-2025-project-adaptateurusbversusbcet10gboslotde3conv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

439 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS-401 ADA Project - Stock Market Dataset

Copycat Portfolios – Do political trades pay off?

Adrien Bousquié, Loan Bianchi, Majandra Garcia, Evan Ayoub, Jérôme Courdacy


Note: the data folder is filled with part of our dataset the backend needs for computing data.


Quickstart

# clone project
git clone https://github.com/epfl-ada/ada-2025-project-adaptateurusbversusbcet10gboslotde3conv.git
cd ada-2025-project-adaptateurusbversusbcet10gboslotde3conv

# [OPTIONAL] create conda environment
conda create -n venvADAptateur python=3.11
conda activate venvADAptateur

# install requirements
pip install -r pip_requirements.txt

You can now run the notebook results.ipynb.

Project Proposal

Abstract

It is now common knowledge that the stock market and political world are closely linked. When we see the market suddenly crashing or jumping, it is very likely the result of some important political decision that has just been made. To prevent profit-driven behavior from politicians and promote transparency, members of the US Senate and Congress are required to disclose all their trades of more than 1000$, along with those of their family members, in the 45 days following their execution. Now, this allows us to use that data to analyze their portfolios: what they look like, how they evolve over time, how well they perform and many other interesting aspects. In particular, one aim of our data analysis is to understand to what extent a trading strategy consisting in copying those trades as soon as they are disclosed would be a good trading strategy.


Research Questions

Main question: How well do the portfolios of these politicians perform and whose perform the best?

Examples of subquestions: What is their return? What is their Informatioin Ratio (average excess return / volatility)? Whose portfolios are better between Republicans and Democrats? Between members of the Senate and Congress?

Main question: Can we find a good trading strategy by copying a well-chosen subset of these disclosed trades?

Examples of subquestions: How does the delay between the transaction of the politician and the moment of disclosure impact the performance of such a trading strategy? How can we grade those politicians to place them on a podium? Would it be a great idea to copy the trading behaviors of the best performers?


Datasets

The main dataset used for this project contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package.

However, it contains prices for up to 01 of April 2020 only. Therefore, we use the more up-to-date version of the Stock Market Dataset by extending the original provided dataset. It enables us to obtain data up until the 01 April 2024. To obtain it, we need to fork and re-run the data collection script available here: Kaggle. This step is done automatically when loading the dataset (when results.ipynb is run).

  • Script: src/data/stock_dataset.py
  • Output (default): data/Stock_Market_Dataset/etfs, data/Stock_Market_Dataset/stocks and data/Stock_Market_Dataset/symbols_valid_meta.csv

All Stocks and ETFs are saved in the folder data/stocks. They are saved using their ticker names as : [ticker].csv.

Proposed additional datasets

In addition to the Stock Market Dataset provided, we use:

The Congressional Trading (Inception to March 23), which can be found on kaggle.com. This dataset contains information about stock trades made by US senators and congressmen. It includes information about the traded security, the day it was traded and the date the trade was disclosed, the type of trade, the individuals making the trade, their political affiliations, and some financial metrics like excess returns (comparison with benchmark indices). In order to tailor the dataset to our needs, we perform the following operations:

  • Standardization and filtering of Ticker symbols, TickerType categories, Transaction categories, and Party categories.
  • Reformating the date columns Traded and Filed.
  • Extraction of minimum, maximum, and mean of ranges in Trade_Size_USD.
  • Remapping of politician names in Name to avoid several names for one single individual.
  • Drop unnecessary columns.

Projects files description

The directory structure of our project looks like this:

├── data                        <- Project data files (ignored by git)
│
├── docs                        <- Folder for the Data Story as a website
│ 
├── src                         <- Source code
│   ├── data                            <- Data directory / package
│   ├── scripts                         <- Shell scripts or python scripts to run
│   └── utils                           <- Utility directory / package
│
├── results.ipynb               <- A well-structured notebook showing the results
│
├── .gitignore                  <- List of files ignored by git
├── pip_requirements.txt        <- File for installing python dependencies
└── README.md

There is a folder data which is created when loading the datasets. It contains:

  • A subfolder stock_dataset which contains:
    • All the Stocks and ETFs, saved as [ticker].csv, in the stocks subfolder.
    • It also has symbols_valid_meta.csv which has some additional metadata for each ticker (such as full name), obtained on Kaggle.
  • duplicate_names.csv was manually created because it appeared that some politiciens or congressmen had their names appearing multiple times in the Congress dataset, but with their name slightly changed (e.g. Mr. Bob Gibbs / Bob Gibbs). It is used in the data preprocessing to fix this problem.

The folder src contains scripts:

  • In the subfolder data:

    • The clean_congress_df.py defines all methods to clean the congress dataset.
    • The copycat_return.py generates the copycat portfolio returns.
    • The generate_all_composite.py builds the composite portfolios.
    • The load_congress_df.py loads the raw congress dataset and saves it locally.
    • The personal_returns generates personal returns for politicians.
    • The portfolio.py contains a class Transaction and functions that calculate daily returns of each stock, in order to later create portfolios.
    • The SPY_return.py generates the SPY returns.
    • The stock_dataset.py loads all NASDAQ stocks metadata, downloads only necessary time series datasets, and cleans the NASDAQ stock market dataset.
  • Subfolder utils has:

    • copycat_utils.py generates rankings of politicians and a simulated portfolio that copies the top ones.
    • data_utils.py returns the start and end dates of a politician’s term in office.
    • evaluation_utils.py contains functions that are used to get average daily returns and information over a year for a given stock.
    • plot_utils.pycontains the functions to make all the plots.
    • statistical_utils.py detects autocorrelation in a return series and performs a Newey–West corrected t-test.
    • transaction_utils.py extracts politicians' transactions, builds portfolios on a given date.

Text files:

  • pip_requirements.txt contains all the required librairies to download before starting the project.

Notebooks:

  • The notebook results.ipynb loads the Stock market dataset and the additional Congress dataset, using the scripts that first clean the datasets. It then contains some exploratory plots of the Stock Market dataset. After that there are the analyses of the two main research questions.

Methods

The idea is to use the stock dataset (original dataset) from Kaggle with the dataset of politician trades to be able to compute the evolution of the portfolio of congress members. We are then able to get:

Their "personal portfolio" :
See details on how we build and compute the politicians' portfolios in results.ipynb.

Their "copycat portfolio" :
See details on how we build and compute the copycat portfolios in results.ipynb.

Based on the built personal portfolios we can analyze and compare the performances of each politician.

We then also analyze how characteristics of politicians play a role in their performances. To this end, we perform descriptive and comparative analyses by grouping politicians according to their characteristics (such as party affiliation or chamber) and comparing the performance metrics of their portfolios across groups, in order to identify which characteristics are associated with better performance. Rather than focusing on full multivariate regression modeling, we rely on aggregated performance measures and group-level comparisons to highlight potential differences in returns linked to specific political characteristics.

We then move to the second research question by analyzing their "copycat portfolios". In the same manner as for their personal portfolios, we compare performance of their average daily returns as well as Information ratios for their copycat portfolios.

To explain how the copycat portfolio is linked to a trading strategy, we interpret the copycat portfolio returns as the outcome of a hypothetical daily rebalanced strategy. At the start of each trading day, a fixed amount of capital is allocated according to the composition of the copycat portfolio, and the portfolio is held over the day. This construction implies that the daily returns of the strategy correspond directly to the daily returns computed for the copycat portfolio, allowing us to evaluate its performance using excess returns and the Information Ratio relative to a benchmark.

Finally, we analyze the risk–return profile of this copycat strategy by comparing its average daily excess returns and Information Ratio to those of a traditional benchmark, namely an ETF tracking the S&P500

In the begining, we intended to do a statistical analysis on whether politicians influence the market. In other words, we wanted to see if people were copying the trading behaviors of politicians, which would mean that the volume would hypothetically increase right after a trade was disclosed. However, it appeared that this analysis was way more complicated that we thought, as it would require to be able to analyse the behavior of the market by accounting for all confounders, and isolating such effects in financial markets is inherently difficult due to their high level of complexity. Therefore, this analysis had to be removed.


Organization within the team

Adrien : In charge of data cleaning and preprocessing. Supervisor of the code organization of the whole project. In charge of the website creation.

Evan : Assistant for data cleaning and preprocessing. Main assistant for Q1. Main assistant for Q2. Ensuring coherence between Q1 and Q2.

Majandra : Assistant for data cleaning and preprocessing. In charge of exploratory data analysis prior to answering Q1 and Q2. In charge of adaptation of Q1 and Q2 for the website.

Jérôme : Assistant for data cleaning and preprocessing. In charge of Q2. In charge of the website introduction and the writing style consistency.

Loan : In charge of Q1. Supervisor of the mathematical consistency throughout the project. In charge of all statistical statements and tests throughout the project.

About

ada-2025-project-adaptateurusbversusbcet10gboslotde3conv created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors