stockXarb
Embedding‑Driven Clustering for Statistical Arbitrage
In this project, I sought to uncover latent arbitrage opportunities by clustering over 9,580 NYSE-traded symbols based on custom feature embeddings. After extracting and log scaling key time series features (trade price, volume, moving averages, and cyclically encoded timestamps), I built a neural embedding model with an input embedding layer followed by multiple dense blocks using LeakyReLU activations, dropout, and L2 regularization to ensure robust generalization under memory constraints.
Machine Learning Techniques
- Log-scaling and moving-average smoothing of raw price and volume signals
- Cyclical encoding (sine/cosine) of hourly timestamps
- Embedding layer for symbol representation followed by dense networks
- LeakyReLU, dropout layers, and L2 weight regularization to prevent overfitting
- Early stopping and ReduceLROnPlateau learning-rate scheduling in lieu of LSTM layers
- PCA for feature compression and t‑SNE for visual cluster inspection
- Spectral clustering and K‑Means (k=4 via elbow method) evaluated by silhouette and purity scores
Results
After training on 41 days of high frequency data with batch optimization, the embedding driven MLP revealed compact symbol representations that captured market dynamics more effectively than traditional correlation matrices. Applying K‑Means to these embeddings yielded four distinct clusters with a silhouette score of 0.607, while cluster purity against sector labels reached 0.426, demonstrating alignment with known industry groupings and suggesting novel arbitrage groups beyond classical approaches.
The end-to-end pipeline, from feature engineering through embedding learning, dimensionality reduction, and unsupervised clustering, provides a scalable framework for statistical arbitrage. Full implementation details and additional quantitative analyses are available in the accompanying paper and GitHub repository.