Can Machines "Learn" Finance?
- Nicolas Rapanos
- May 10, 2020
- 6 min read
Updated: Mar 10, 2021
Gu, Kelly, and Xiu (2020) (henceforth GKX) have recently published an influential paper in the Review of Financial Studies, comparing the performance of machine learning methods against other more traditional statistical techniques in the context of the most widely studied problem in finance, that of measuring equity risk premia. I have replicated their results and wrote a short article sharing implementation details and tips for training neural networks with real-world financial data. My findings are generally consistent with the original paper: neural networks improve prediction of individual stock returns out-of-sample, shallow learning outperforms deep learning, and economic gains from machine learning forecasts are large. The purpose of this blog post is to communicate the key results of my replication work in a language that's accessible to a broader audience.
Let us begin with Asset Pricing 101. When we think of asset returns (in excess of the risk-free rate), we typically have in mind an additive prediction error model:
breaking asset returns down to a conditional expectation term and a residual error term. The field of asset pricing aims to understand the drivers and the dynamics of conditional expected returns. Academic finance traditionally refers to this quantity as the "risk premium" because we believe that in equilibrium investors are compensated for bearing some form of systematic/non-diversifiable risk or risks.
Suppose that there are K underlying sources of risk. Under rational expectations and certain assumptions about investors' preferences, we can arrive at the following representation of expected returns:
where λ is the price of risk or risk premium and β measures the exposure of each asset to the underlying factors. Typically, we express risk factors in terms of traded portfolios. In that case, risk prices are captured by the average returns on the traded portfolios. Either way, individual asset expected returns are proportional to the asset's exposure to each risk factor and to each factor's price of risk. This is the backbone of modern asset pricing theory.
Any deviation from this relationship is because either investors are not acting rationally due to some behavioral cause (e.g., overreaction) or the modeling of investors' behavior is wrong. Of course, the underlying model can be wrong in many ways. For example, investors may be constrained due to financial frictions (e.g., short-selling constraints), or a risk-factor may be missing. These alternatives lay the seeds of academic research in asset pricing and have generated voluminous literature. Another possibility that is often overlooked by the literature is the presence of nonlinearities and interaction effects. In fact, researchers often recognize these limitations but resort to ad-hoc solutions to deal with these problems. More specifically, portfolio sorts is the standard way to deal with nonlinearities and product terms are sometimes introduced to capture interaction effects, but machine learning methods are much better suited to deal with both problems.
Thus, we can generalize the standard beta-pricing representation of expected returns as follows:
where z captures asset-specific characteristics and g*(.) is a flexible function of these predictors. The objective is to isolate a representation of conditional expected returns as a function of predictor variables z that maximizes out-of-sample explanatory power for realized returns r. In particular, I use fully-connected neural networks with up to four hidden layers to learn a flexible functional form for conditional expected returns. Implementation details are beyond the scope of this post, but the interested reader is refered to my article and the original paper.
An Empirical Study of U.S. Equities
The study takes place in the U.S. equities space. I obtain monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ. The sample begins in March 1957 (the start date of the S&P 500) and ends in December 2016, totaling 60 years. The dataset of stock-level characteristics comes directly from Dacheng Xiu's website. These include 94 characteristics (61 of which are updated annually, 13 are updated quarterly, and 20 are updated monthly). GKX also include 74 industry dummy variables and cross-terms from the interaction of stock-level characteristics with Welch and Goyal (2008) macro factors. For the purposes of this replication exercise, I have chosen to use only the 94 stock-level characteristics in order to reduce computation time.
We divide the 60 years of data into 18 years of training sample (1957-1974), 12 years of validation sample (1975-1986), and the remaining 30 years (1987-2016) for out-of-sample testing. Because machine learning algorithms are computationally intensive, we avoid recur- sively refitting models each month. Instead, we refit once every year as most of our signals are updated once per year. Each time we refit, we increase the training sample by 1 year. We maintain the same size of the validation sample, but roll it forward to include the most recent 12 months. Thus, we have 30 experiments consisting of a training, validation, and test set each, as shown in the following figure.

Predictive performance for individual stock return forecasts is evaluated with the out-of-sample R2. The following table presents the performance of neural network architectures with 1 to 4 layers (NN1-NN4). Monthly R2's range from 0.34% to 0.51%, with the best performance obtained by NN1. GKX report out-of-sample R2 values between 0.33% and 0.40% for the same models. Therefore, our results are quantitatively very close to the figures reported in the original paper, even though our predictions are generated by a single model (rather than an ensemble of models with different random seeds, as in GKX) and we did not include cross-terms in the set of predictor variables that would take the number of features from 94 to 920.

Which covariates matter? Machine learning models are often treated as black-boxes whose predictions are not easily interpretable. While neural networks may be among the the least transparent, least interpretable, and most highly parameterized machine learning tools, there are methods that allow us to take a look under the hood and identify covariates that have an important influence on model predictions.
The relative importance of individual covariates can be measured using the Shapley Additive Explanation values (SHAP). The Shapley value (proposed by Lloyd Shapley in 1953) is a classic method in game theory to distribute the total gains of a collaborative game to a coalition of cooperating players. In our case, we formulate a game for the prediction at each instance. We consider the "total gains" to be the prediction value for that instance, and the "players" to be the model features of that instance. The collaborative game is all of the model features cooperating to form a prediction value, and the feature attributions should sum to the prediction value. The attributions can be negative or positive, since a feature can lower or raise a predicted value. Thus, Shapley values can be interpreted as the impact of a given feature on the model's output. The following figure plots the Shapley values for the top 20 most relevant features of NN1.

The covariates that matter the most can be grouped as follows:
price trends: (i.e., short-term reversal (mom1m), momentum change (chmom), industry momentum (indmom))
liquidity variables: (i.e., log market equity (mvel1), turnover volatility (std turn), dollar volume (dolvol))
valuation ratios and fundamental signals: (i.e., earnings-to-price (ep), sales-to- price (sp), book-to-market (bm))
Shapley values can also be used to study interaction effects between different variables.
Machine Learning Portfolios
In order to assess the economic significance of our predictions, we design a set of portfolios to directly exploit machine learning forecasts. At the end of each month, we calculate 1-month ahead out-of-sample stock return predictions for each method. We then sort stocks into deciles based on each model's forecasts. We reconstitute portfolios each month using equal weights. Finally, we construct a zero-net-investment portfolio that buys the highest expected return stocks (decile 10) and sells the lowest (decile 1). The following figure shows the cumulative log-returns of these portfolios for NN1.

Realized returns generally increase monotonically with machine learning forecasts from every model. The best 10-1 strategy comes from NN1, which returns on average 3.86% per month. Its monthly volatility is 4.94%, amounting to an annualized out-of-sample Sharpe ratio of 2.71. These results come, of course, with a huge caveat since transaction costs may as well cancel out these expected trading gains. Nonetheless, it is quite encouraging that machine learning forecasts are able to identify portfolios whose out-of-sample realized performance is in line with those forecasts. Furthermore, GKX report similar results that are robust to using either value or equal weights, including or excluding microcaps, and also show that machine learning forecasts can be aggregated to predict S&P 500 and other portfolios.
Conclusion
Going back to the question of the title of this post, it seems that, indeed, machines can "learn" finance. Despite low signal-to-noise ratio, structural breaks, and small data sets (compared to the huge data sets used in other fields, such as computer vision), neural networks are able to learn meaningful relationships from the data, and this is reflected in out-of-sample performance. Finally, machine learning is not necessarily a "black box"; tools such as the Shapley Additive Explanation values (SHAP) can help us interpret model outputs and investigate interaction effects.