See the code on Github.


From May 2021 to Jan 2023, I competed as a ML/Data Scientist on Numerai, the “hardest data science tournament in the world”.

  • Each week, I generated stock market predictions using several ML models I’ve developed.
  • I staked those models with my own money (using Numerai’s cryptocurrency, Numeraire).
  • My predictions contributed to the Numerai meta-model which drives their hedge fund positions.
  • I also played around with generating additional trading signals by scraping data from other sources, like prediction markets.

I got pretty good at it:

  • I regularly placed in the top-100 staked models in 2022.
  • My best model earned 95% annual return in 2022 (even while the global markets were eating shit).

I learned a ton about:

  • algorithmic trading
  • quantitative finance
  • building performant models from challenging tabular datasets

But in late 2022, I started my own AI company and couldn’t devote any more time to improving my models. Numerai has also been expanding their underlying dataset, so I would have had to spend a lot of development time to update and expand my models. I decided to stop competing weekly.

So I’m open-sourcing my code as a jumping-off point for others who are interesting in learning more about Numerai, data science, and quantitative trading.

What you’ll find on Github

I did everything in Jupyter notebooks that I ran in Google Colab.

This was great for the experimentation process and for being able to trigger my models to automatically run and submit predictions via a web browser from anywhere in the world. But in general, it’s not the best setup: Jupyter notebooks are hard to version and maintain and although Colab is great for getting access to decent GPUs for free/cheap, it’s pretty janky.

  • experiments/ contains the notebooks where I analysed the data, tried new feature engineering techniques, and compared my new model ideas against various baselines (including my own best-performing model, V3X.
  • models/ contains the notebooks that I would run each week to submit predictions from my current “stable” models:
    • V3X was my best-performing model over time, based on a meta-ensemble of gradient boosted trees. It never made a killing, but it was super consistent and robust.
    • GTRUDA was a neural network based on a convolutional autoencoder. It did really well some weeks, but was a bit unstable in training and had high variance. But developing that model inspired many ideas behind my later research on TableDiffusion.

V3X, my best model

V3X was my best-performing model over time, based on an ensemble of gradient-boosted trees.

This is just an isolated example of the actual model for posterity. To actually see all the data wrangling and engineering that makes this work, check out the Jupyter notebook in the repo.

from xgboost import XGBRegressor
from sklearn.linear_model import LassoCV
from sklearn.decomposition import PCA
from sklearn.base import BaseEstimator

class EraEnsemble(BaseEstimator):
    def __init__(self,
                 n_subs=10,
                 pca_frac=None,
                 subalg=XGBRegressor,
                 mainalg=LassoCV,
                 ):
        self.n_subs = n_subs
        self.submodels = []
        self.sub_preds = []
        self.pca_frac = pca_frac
        self.transforms = []
        self.subalg = subalg
        self.mainalg = mainalg

    def get_params(self, *args, **kwargs):
        return {
            'n_subs': self.n_subs,
            'pca_frac': self.pca_frac,
            'subalg': self.subalg,
            'mainalg': self.mainalg,
        }

    def fit(self, df, y, validation=df_val):
        # Partition the "eras" of data
        n_eras = df.era.nunique()
        min_era = df.era.min()
        max_era = df.era.max()
        STEP = n_eras//self.n_subs

        # Loop over era ranges
        for i in range(min_era, max_era, STEP):
            _data = df[df.era.between(i, i+STEP)]
            _target = _data['target']
            _data = _data[FEAT_COLS]
            if self.pca_frac < 1.0:
                _pca = PCA(n_components=self.pca_frac).fit(_data)
                _data = _pca.transform(_data)
                self.transforms.append(_pca)
            if self.subalg == XGBRegressor:
                submodel = self.subalg(verbosity=0).fit(_data, _target)
            else:
                submodel = self.subalg().fit(_data, _target)
            self.submodels.append(submodel)
            if self.pca_frac < 1.0:
                _preds = submodel.predict(
                    _pca.transform(df[FEAT_COLS])
                )
            else:
                _preds = submodel.predict(df[FEAT_COLS])
            self.sub_preds.append(_preds)
        _X = np.array(self.sub_preds).T
        self.mainmodel = self.mainalg().fit(_X, y)

        return self

    def predict(self, df):
        _preds = []
        for i, sm in enumerate(self.submodels):
            if self.pca_frac < 1.0:
                _data = self.transforms[i].transform(df[FEAT_COLS])
            else:
                _data = df[FEAT_COLS]
            _preds.append(sm.predict(_data))
        _X = np.array(_preds).T

        return self.mainmodel.predict(_X)

Tips for getting started with Numerai

  1. Read the numerai docs, they’re excellent and will help you get started fast.
  2. Play around and build your own simple models. You can submit weekly predictions from them for free without needing to stake any money.
  3. Get really good by analysing and understanding the dataset. Note that the dataset has updated from the version I was using back in 2022.
  4. Start staking your own models with small sums that you’re happy to lose. Get skin in the game.
  5. Check out my repo to get ideas for other approaches to try and model architecture inspiration.
  6. Check out the Numerai forum and bounce ideas with others. Most people are doing this for fun, so they’re willing to share information and code.