Data story website: A Data Story on Congressional Stock Trading Behavior
Note: the data folder is filled with part of our dataset the backend needs for computing data.
# clone project
git clone https://github.com/epfl-ada/ada-2025-project-adaptateurusbversusbcet10gboslotde3conv.git
cd ada-2025-project-adaptateurusbversusbcet10gboslotde3conv
# [OPTIONAL] create conda environment
conda create -n venvADAptateur python=3.11
conda activate venvADAptateur
# install requirements
pip install -r pip_requirements.txtYou can now run the notebook
results.ipynb.
It is now common knowledge that the stock market and political world are closely linked. When we see the market suddenly crashing or jumping, it is very likely the result of some important political decision that has just been made. To prevent profit-driven behavior from politicians and promote transparency, members of the US Senate and Congress are required to disclose all their trades of more than 1000$, along with those of their family members, in the 45 days following their execution. Now, this allows us to use that data to analyze their portfolios: what they look like, how they evolve over time, how well they perform and many other interesting aspects. In particular, one aim of our data analysis is to understand to what extent a trading strategy consisting in copying those trades as soon as they are disclosed would be a good trading strategy.
Main question: How well do the portfolios of these politicians perform and whose perform the best?
Examples of subquestions: What is their return? What is their Informatioin Ratio (average excess return / volatility)? Whose portfolios are better between Republicans and Democrats? Between members of the Senate and Congress?
Main question: Can we find a good trading strategy by copying a well-chosen subset of these disclosed trades?
Examples of subquestions: How does the delay between the transaction of the politician and the moment of disclosure impact the performance of such a trading strategy? How can we grade those politicians to place them on a podium? Would it be a great idea to copy the trading behaviors of the best performers?
The main dataset used for this project contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package.
However, it contains prices for up to 01 of April 2020 only. Therefore, we use the more up-to-date version of the Stock Market Dataset by extending the original provided dataset. It enables us to obtain data up until the 01 April 2024. To obtain it, we need to fork and re-run the data collection script available here: Kaggle. This step is done automatically when loading the dataset (when results.ipynb is run).
- Script:
src/data/stock_dataset.py - Output (default):
data/Stock_Market_Dataset/etfs,data/Stock_Market_Dataset/stocksanddata/Stock_Market_Dataset/symbols_valid_meta.csv
All Stocks and ETFs are saved in the folder
data/stocks. They are saved using their ticker names as :[ticker].csv.
In addition to the Stock Market Dataset provided, we use:
The Congressional Trading (Inception to March 23), which can be found on kaggle.com. This dataset contains information about stock trades made by US senators and congressmen. It includes information about the traded security, the day it was traded and the date the trade was disclosed, the type of trade, the individuals making the trade, their political affiliations, and some financial metrics like excess returns (comparison with benchmark indices). In order to tailor the dataset to our needs, we perform the following operations:
- Standardization and filtering of
Tickersymbols,TickerTypecategories,Transactioncategories, andPartycategories. - Reformating the date columns
TradedandFiled. - Extraction of minimum, maximum, and mean of ranges in
Trade_Size_USD. - Remapping of politician names in
Nameto avoid several names for one single individual. - Drop unnecessary columns.
The directory structure of our project looks like this:
├── data <- Project data files (ignored by git)
│
├── docs <- Folder for the Data Story as a website
│
├── src <- Source code
│ ├── data <- Data directory / package
│ ├── scripts <- Shell scripts or python scripts to run
│ └── utils <- Utility directory / package
│
├── results.ipynb <- A well-structured notebook showing the results
│
├── .gitignore <- List of files ignored by git
├── pip_requirements.txt <- File for installing python dependencies
└── README.md
There is a folder data which is created when loading the datasets. It contains:
- A subfolder
stock_datasetwhich contains:- All the Stocks and ETFs, saved as
[ticker].csv, in thestockssubfolder. - It also has
symbols_valid_meta.csvwhich has some additional metadata for each ticker (such as full name), obtained on Kaggle.
- All the Stocks and ETFs, saved as
duplicate_names.csvwas manually created because it appeared that some politiciens or congressmen had their names appearing multiple times in the Congress dataset, but with their name slightly changed (e.g. Mr. Bob Gibbs / Bob Gibbs). It is used in the data preprocessing to fix this problem.
The folder src contains scripts:
-
In the subfolder
data:- The
clean_congress_df.pydefines all methods to clean the congress dataset. - The
copycat_return.pygenerates the copycat portfolio returns. - The
generate_all_composite.pybuilds the composite portfolios. - The
load_congress_df.pyloads the raw congress dataset and saves it locally. - The
personal_returnsgenerates personal returns for politicians. - The
portfolio.pycontains a class Transaction and functions that calculate daily returns of each stock, in order to later create portfolios. - The
SPY_return.pygenerates the SPY returns. - The
stock_dataset.pyloads all NASDAQ stocks metadata, downloads only necessary time series datasets, and cleans the NASDAQ stock market dataset.
- The
-
Subfolder
utilshas:copycat_utils.pygenerates rankings of politicians and a simulated portfolio that copies the top ones.data_utils.pyreturns the start and end dates of a politician’s term in office.evaluation_utils.pycontains functions that are used to get average daily returns and information over a year for a given stock.plot_utils.pycontains the functions to make all the plots.statistical_utils.pydetects autocorrelation in a return series and performs a Newey–West corrected t-test.transaction_utils.pyextracts politicians' transactions, builds portfolios on a given date.
Text files:
pip_requirements.txtcontains all the required librairies to download before starting the project.
Notebooks:
- The notebook
results.ipynbloads the Stock market dataset and the additional Congress dataset, using the scripts that first clean the datasets. It then contains some exploratory plots of the Stock Market dataset. After that there are the analyses of the two main research questions.
The idea is to use the stock dataset (original dataset) from Kaggle with the dataset of politician trades to be able to compute the evolution of the portfolio of congress members. We are then able to get:
Their "personal portfolio" :
See details on how we build and compute the politicians' portfolios in results.ipynb.
Their "copycat portfolio" :
See details on how we build and compute the copycat portfolios in results.ipynb.
Based on the built personal portfolios we can analyze and compare the performances of each politician.
We then also analyze how characteristics of politicians play a role in their performances. To this end, we perform descriptive and comparative analyses by grouping politicians according to their characteristics (such as party affiliation or chamber) and comparing the performance metrics of their portfolios across groups, in order to identify which characteristics are associated with better performance. Rather than focusing on full multivariate regression modeling, we rely on aggregated performance measures and group-level comparisons to highlight potential differences in returns linked to specific political characteristics.
We then move to the second research question by analyzing their "copycat portfolios". In the same manner as for their personal portfolios, we compare performance of their average daily returns as well as Information ratios for their copycat portfolios.
To explain how the copycat portfolio is linked to a trading strategy, we interpret the copycat portfolio returns as the outcome of a hypothetical daily rebalanced strategy. At the start of each trading day, a fixed amount of capital is allocated according to the composition of the copycat portfolio, and the portfolio is held over the day. This construction implies that the daily returns of the strategy correspond directly to the daily returns computed for the copycat portfolio, allowing us to evaluate its performance using excess returns and the Information Ratio relative to a benchmark.
Finally, we analyze the risk–return profile of this copycat strategy by comparing its average daily excess returns and Information Ratio to those of a traditional benchmark, namely an ETF tracking the S&P500
In the begining, we intended to do a statistical analysis on whether politicians influence the market. In other words, we wanted to see if people were copying the trading behaviors of politicians, which would mean that the volume would hypothetically increase right after a trade was disclosed. However, it appeared that this analysis was way more complicated that we thought, as it would require to be able to analyse the behavior of the market by accounting for all confounders, and isolating such effects in financial markets is inherently difficult due to their high level of complexity. Therefore, this analysis had to be removed.
Adrien : In charge of data cleaning and preprocessing. Supervisor of the code organization of the whole project. In charge of the website creation.
Evan : Assistant for data cleaning and preprocessing. Main assistant for Q1. Main assistant for Q2. Ensuring coherence between Q1 and Q2.
Majandra : Assistant for data cleaning and preprocessing. In charge of exploratory data analysis prior to answering Q1 and Q2. In charge of adaptation of Q1 and Q2 for the website.
Jérôme : Assistant for data cleaning and preprocessing. In charge of Q2. In charge of the website introduction and the writing style consistency.
Loan : In charge of Q1. Supervisor of the mathematical consistency throughout the project. In charge of all statistical statements and tests throughout the project.