stockXarb

Jan 2024 - May 2024 | GitHub | Download PDF

Embedding‑Driven Clustering for Statistical Arbitrage

In this project, I sought to uncover latent arbitrage opportunities by clustering over 9,580 NYSE-traded symbols based on custom feature embeddings. After extracting and log scaling key time series features (trade price, volume, moving averages, and cyclically encoded timestamps), I built a neural embedding model with an input embedding layer followed by multiple dense blocks using LeakyReLU activations, dropout, and L2 regularization to ensure robust generalization under memory constraints.

Machine Learning Techniques

Log-scaling and moving-average smoothing of raw price and volume signals
Cyclical encoding (sine/cosine) of hourly timestamps
Embedding layer for symbol representation followed by dense networks
LeakyReLU, dropout layers, and L2 weight regularization to prevent overfitting
Early stopping and ReduceLROnPlateau learning-rate scheduling in lieu of LSTM layers
PCA for feature compression and t‑SNE for visual cluster inspection
Spectral clustering and K‑Means (k=4 via elbow method) evaluated by silhouette and purity scores

Results

After training on 41 days of high frequency data with batch optimization, the embedding driven MLP revealed compact symbol representations that captured market dynamics more effectively than traditional correlation matrices. Applying K‑Means to these embeddings yielded four distinct clusters with a silhouette score of 0.607, while cluster purity against sector labels reached 0.426, demonstrating alignment with known industry groupings and suggesting novel arbitrage groups beyond classical approaches.

The end-to-end pipeline, from feature engineering through embedding learning, dimensionality reduction, and unsupervised clustering, provides a scalable framework for statistical arbitrage. Full implementation details and additional quantitative analyses are available in the accompanying paper and GitHub repository.

Deniz Qian

Embedding‑Driven Clustering for Statistical Arbitrage

Machine Learning Techniques

Results