Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Penalized regressions and sparse hedging for minimum variance portfolios

\index{penalized regression}

In this chapter, we introduce the widespread concept of regularization for linear models. There are in fact several possible applications for these models. The first one is straightforward: resort to penalizations to improve the robustness of factor-based predictive regressions. The outcome can then be used to fuel an allocation scheme. For instance, Han et al. (2022) and Rapach & Zhou (2021) use penalized regressions to improve stock return prediction when combining forecasts that emanate from individual characteristics.

Similar ideas can be developed for macroeconomic predictions for instance, as in Uematsu & Tanaka (2019). The second application stems from a less known result which originates from Stevens (1998). It links the weights of optimal mean-variance portfolios to particular cross-sectional regressions. The idea is then different and the purpose is to improve the quality of mean-variance driven portfolio weights. We present the two approaches below after an introduction on regularization techniques for linear models.

Other examples of financial applications of penalization can be found in d'Aspremont (2011), Ban et al. (2016) and Kremer et al. (2019). In any case, the idea is the same as in the seminal paper Tibshirani (1996): standard (unconstrained) optimization programs may lead to noisy estimates, thus adding a structuring constraint helps remove some noise (at the cost of a possible bias). For instance, Kremer et al. (2019) use this concept to build more robust mean-variance (Markowitz (1952)) portfolios and Freyberger et al. (2020) use it to single out the characteristics that really help explain the cross-section of equity returns.

5.1Penalized regressions

\index{penalized regression}

5.1.1Simple regressions

\index{simple regression} \index{regression} The ideas behind linear models are at least two centuries old (Legendre (1805) is an early reference on least squares optimization). Given a matrix of predictors X\textbf{X}, we seek to decompose the output vector y\textbf{y} as a linear function of the columns of X\textbf{X} (written Xβ\textbf{X}\boldsymbol{\beta}) plus an error term ϵ\boldsymbol{\epsilon}: y=Xβ+ϵ\textbf{y}=\textbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}.

The best choice of β\boldsymbol{\beta} is naturally the one that minimizes the error. For analytical tractability, it is the sum of squared errors that is minimized: L=ϵϵ=i=1Iϵi2L=\boldsymbol{\epsilon}'\boldsymbol{\epsilon}=\sum_{i=1}^I\epsilon_i^2. The loss LL is called the sum of squared residuals (SSR). In order to find the optimal β\boldsymbol{\beta}, it is imperative to differentiate this loss LL with respect to β\boldsymbol{\beta} because the first order condition requires that the gradient be equal to zero:

βL=β(yXβ)(yXβ)=ββXXβ2yXβ=2XXβ2Xy\begin{align*} \nabla_{\boldsymbol{\beta}} L&=\frac{\partial}{\partial \boldsymbol{\beta}}(\textbf{y}-\textbf{X}\boldsymbol{\beta})'(\textbf{y}-\textbf{X}\boldsymbol{\beta})=\frac{\partial}{\partial \boldsymbol{\beta}}\boldsymbol{\beta}'\textbf{X}'\textbf{X}\boldsymbol{\beta}-2\textbf{y}'\textbf{X}\boldsymbol{\beta} \\ &=2\textbf{X}'\textbf{X}\boldsymbol{\beta} -2\textbf{X}'\textbf{y} \end{align*}

so that the first order condition β=0\nabla_{\boldsymbol{\beta}}=\textbf{0} is satisfied if

β=(XX)1Xy,\boldsymbol{\beta}^*=(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{y},

which is known as the standard ordinary least squares (OLS)\index{ordinary least squares} solution of the linear model. If the matrix X\textbf{X} has dimensions I×KI \times K, then the XX\textbf{X}'\textbf{X} can only be inverted if the number of rows II is strictly superior to the number of columns KK. In some cases, that may not hold; there are more predictors than instances and there is no unique value of β\boldsymbol{\beta} that minimizes the loss. If XX\textbf{X}'\textbf{X} is nonsingular (or positive definite), then the second order condition ensures that β\boldsymbol{\beta}^* yields a global minimum for the loss LL (the second order derivative of LL with respect to β\boldsymbol{\beta}, the Hessian matrix, is exactly XX\textbf{X}'\textbf{X}).

Up to now, we have made no distributional assumption on any of the above quantities. Standard assumptions are the following:

Under these hypotheses, it is possible to perform statistical tests related to the β^\hat{\boldsymbol{\beta}} coefficients. We refer to chapters 2 to 4 in Greene (2018) for a thorough treatment on linear models as well as to chapter 5 of the same book for details on the corresponding tests.

5.1.2Forms of penalizations

\index{penalized regression} Penalized regressions have been popularized since the seminal work of Tibshirani (1996). The idea is to impose a constraint on the coefficients of the regression, namely that their total magnitude be restrained. In his original paper, Tibshirani (1996) proposes to estimate the following model (LASSO):\index{LASSO}

yi=j=1Jβjxi,j+ϵi,i=1,,I,s.t.j=1Jβj<δ,y_i = \sum_{j=1}^J \beta_jx_{i,j} + \epsilon_i, \quad i =1,\dots,I, \quad \text{s.t.} \quad \sum_{j=1}^J |\beta_j| < \delta,

for some strictly positive constant δ\delta. Under least square minimization, this amounts to solve the Lagrangian formulation:

minβ{i=1I(yij=1Jβjxi,j)2+λj=1Jβj},\underset{\mathbf{\beta}}{\min} \, \left\{ \sum_{i=1}^I\left(y_i - \sum_{j=1}^J \beta_jx_{i,j} \right)^2+\lambda \sum_{j=1}^J |\beta_j| \right\},

for some value λ>0\lambda>0 which naturally depends on δ\delta (the lower the δ\delta, the higher the λ\lambda: the constraint is more binding). This specification seems close to the ridge regression (L2L^2 regularization), which is in fact anterior to the Lasso:\index{ridge regression}

minβ{i=1I(yij=1Jβjxi,j)2+λj=1Jβj2},\underset{\mathbf{\beta}}{\min} \, \left\{ \sum_{i=1}^I\left(y_i - \sum_{j=1}^J\beta_jx_{i,j} \right)^2+\lambda \sum_{j=1}^J \beta_j^2 \right\},

and which is equivalent to estimating the following model

yi=j=1Jβjxi,j+ϵi,i=1,,I,s.t.j=1Jβj2<δ,y_i = \sum_{j=1}^J \beta_jx_{i,j} + \epsilon_i, \quad i =1,\dots,I, \quad \text{s.t.} \quad \sum_{j=1}^J \beta_j^2 < \delta,

but the outcome is in fact quite different, which justifies a separate treatment. Mechanically, as λ\lambda, the penalization intensity,\index{penalization intensity} increases (or as δ\delta in @ref(eq:ridge6) decreases), the coefficients of the ridge regression all slowly decrease in magnitude towards zero. In the case of the LASSO, the convergence is somewhat more brutal as some coefficients shrink to zero very quickly. For λ\lambda sufficiently large, only one coefficient will remain nonzero, while in the ridge regression, the zero value is only reached asymptotically for all coefficients. We invite the interested read to have a look at the survey in Hastie (2020) about all applications of ridge regressions in data science with links to other topics like cross-validation\index{cross-validation} and dropout regularization\index{dropout}, among others.\index{ridge regression}

To depict the difference between the Lasso and the ridge regression, let us consider the case of K=2K=2 predictors which is shown in Figure @ref(fig:lassoridge). The optimal unconstrained solution β\boldsymbol{\beta}^* is pictured in red in the middle of the space. The problem is naturally that it does not satisfy the imposed conditions. These constraints are shown in light grey: they take the shape of a square β1+β2δ|\beta_1|+|\beta_2| \le \delta in the case of the Lasso and a circle β12+β22δ\beta_1^2+\beta_2^2 \le \delta for the ridge regression. In order to satisfy these constraints, the optimization needs to look in the vicinity of β\boldsymbol{\beta}^* by allowing for larger error levels. These error levels are shown as orange ellipsoids in the figure. When the requirement on the error is loose enough, one ellipsoid touches the acceptable boundary (in grey) and this is where the constrained solution is located.

Schematic view of Lasso (left) versus ridge (right) regressions.

Figure 5.1:Schematic view of Lasso (left) versus ridge (right) regressions.

Both methods work when the number of exogenous variables surpasses that of observations, i.e., in the case where classical regressions are ill-defined. This is easy to see in the case of the ridge regression for which the OLS solution is simply

β^=(XX+λIN)1XY.\hat{\boldsymbol{\beta}}=(\mathbf{X}'\mathbf{X}+\lambda \mathbf{I}_N)^{-1}\mathbf{X}'\mathbf{Y}.

The additional term λIN\lambda \mathbf{I}_N compared to Equation @ref(eq:regbeta) ensures that the inverse matrix is well-defined whenever λ>0\lambda>0. As λ\lambda increases, the magnitudes of the β^i\hat{\beta}_i decrease, which explains why penalizations are sometimes referred to as shrinkage methods (the estimated coefficients see their values shrink). \index{ridge regression}

Zou & Hastie (2005) propose to benefit from the best of both worlds when combining both penalizations in a convex manner (which they call the elasticnet):\index{elasticnet}

yi=j=1Jβjxi,j+ϵi,s.t.αj=1Jβj+(1α)j=1Jβj2<δ,i=1,,N,y_i = \sum_{j=1}^J \beta_jx_{i,j} + \epsilon_i, \quad \text{s.t.} \quad \alpha \sum_{j=1}^J |\beta_j| +(1-\alpha)\sum_{j=1}^J \beta_j^2< \delta, \quad i =1,\dots,N,

which is associated to the optimization program

minβ{i=1I(yij=1Jβjxi,j)2+λ(αj=1Jβj+(1α)j=1Jβj2)}.\underset{\mathbf{\beta}}{\min} \, \left\{ \sum_{i=1}^I\left(y_i - \sum_{j=1}^J\beta_jx_{i,j} \right)^2+\lambda \left(\alpha\sum_{j=1}^J |\beta_j|+ (1-\alpha)\sum_{j=1}^J \beta_j^2\right) \right\}.

The main advantage of the LASSO compared to the ridge regression is its selection capability. Indeed, given a very large number of variables (or predictors), the LASSO will progressively rule out those that are the least relevant. The elasticnet preserves this selection ability and Zou & Hastie (2005) argue that in some cases, it is even more effective than the LASSO. The parameter α[0,1]\alpha \in [0,1] tunes the smoothness of convergence (of the coefficients) towards zero. The closer α\alpha is to zero, the smoother the convergence.

5.1.3Illustrations

We begin with simple illustrations of penalized regressions. We start with the LASSO. First, we estimate the coefficients. By default, the function chooses a large array of penalization values so that the results for different penalization intensities (λ\lambda) can be shown immediately.\index{penalized regression}

from sklearn.linear_model import lasso_path, Ridge, ElasticNet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from statsmodels.tsa.stattools import acf
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Building the data & import functions
from data_build import generate_data
data_ml, features, features_short, returns, stock_ids, stock_ids_short = generate_data()
features_short =["Div_yld", "EPS", "Size12m", "Mom_LT", "Ocf", "PB", "Vol_LT"]
separation_date = "2017-01-15"

data_ml_clean = data_ml.dropna(subset=(features_short+ ['R1M']))
y_penalized = data_ml_clean['R1M'].values
x_penalized = data_ml_clean[features_short].values

# Compute LASSO path (alpha=1 in sklearn corresponds to pure LASSO)
alphas_lasso, coefs_lasso, _ = lasso_path(x_penalized, y_penalized)  # Model: LASSO

Once the coefficients are computed, they require some wrangling before plotting. Also, there are too many of them, so we only plot a subset of them.

n_features_to_plot = min(10, len(features_short)) 

plt.rcParams.update({'font.size': 14})
plt.figure(figsize=(10, 6))
for i in range(n_features_to_plot):
    plt.plot(alphas_lasso, coefs_lasso[i, :], label=features[i])

plt.xlabel('Lambda')
plt.ylabel('beta')
plt.xscale('log')
plt.gca().invert_xaxis()  # Higher lambda = more regularization
plt.tight_layout()
plt.legend(fontsize=12, loc='best', facecolor='white')
plt.show()

Figure 5.2:LASSO coefficients. The dependent variable is the 1 month ahead return

<Figure size 1000x600 with 1 Axes>

The graph plots the evolution of coefficients as the penalization intensity,\index{penalization intensity} λ\lambda, increases. For some characteristics, like Asset Turnover (in green), the convergence to zero is rapid. Other variables resist the penalization longer, like Capex Assets, which is the last one to vanish. Essentially, this means that at the first order, this variable is an important driver of future 1-month returns in our sample.

Next, we turn to ridge regressions.

alphas_ridge = np.logspace(-2, 6, 100)  # Range of regularization values
coefs_ridge = []

for alpha in alphas_ridge:
    ridge = Ridge(alpha=alpha, fit_intercept=True)
    ridge.fit(x_penalized, y_penalized)
    coefs_ridge.append(ridge.coef_)

coefs_ridge = np.array(coefs_ridge)  # Shape: (n_alphas, n_features)

# Plot Ridge coefficients
plt.rcParams.update({'font.size': 14})
plt.figure(figsize=(10, 6))
for i in range(n_features_to_plot):
    plt.plot(alphas_ridge, coefs_ridge[:, i], label=features[i])

plt.xlabel('Lambda')
plt.ylabel('beta')
plt.legend(fontsize=12, loc='best', facecolor='white')
plt.xscale('log')
plt.tight_layout()
plt.show()

Figure 5.3:Ridge coefficients. The dependent variable is the 1 month ahead return

<Figure size 1000x600 with 1 Axes>

In Figure @ref(fig:sparseridge), the convergence to zero is much smoother. We underline that the x-axis (penalization intensities) have a log-scale. This allows to see the early patterns (close to zero, to the left) more clearly. As in the previous figure, the Mkt_Cap_3M_Usd predictor clearly dominates, with again large negative coefficients. Nonetheless, as λ\lambda increases, its domination over the other predictor fades.

By definition, the elasticnet will produce curves that behave like a blend of the two above approaches. Nonetheless, as long as α>0\alpha >0, the selective property of the LASSO will be preserved: some features will see their coefficients shrink rapidly to zero. In fact, the strength of the LASSO is such that a balanced mix of the two penalizations is not reached at α=1/2\alpha = 1/2, but rather at a much smaller value (possibly below 0.1).

5.2Sparse hedging for minimum variance portfolios

\index{sparse hedging}

5.2.1Presentation and derivations

The idea of constructing sparse portfolios\index{sparse portfolios} is not new per se (see, e.g., Brodie et al. (2009), Fastrich et al. (2015)) and the link with the selective property of the LASSO is rather straightforward in classical quadratic programs. Note that the choice of the L1L^1 norm is imperative because when enforcing a simple L2L^2 norm, the diversification of the portfolio increases (see Coqueret (2015)).

The idea behind this section stems from Goto & Xu (2015) but the cornerstone result was first published by Stevens (1998) and we present it below. We provide details because the derivations are not commonplace in the literature.

In usual mean-variance allocations, one core ingredient is the inverse covariance matrix of assets Σ1\mathbf{\Sigma}^{-1}. For instance, the maximum Sharpe ratio (MSR) portfolio is given by

(#eq:MSR)wMSR=Σ1μ1Σ1μ,(\#eq:MSR) \mathbf{w}^{\text{MSR}}=\frac{\mathbf{\Sigma}^{-1}\boldsymbol{\mu}}{\mathbf{1}'\mathbf{\Sigma}^{-1}\boldsymbol{\mu}},

where μ\mathbf{\mu} is the vector of expected (excess) returns. Taking μ=1\mathbf{\mu}=\mathbf{1} yields the minimum variance portfolio,\index{minimum variance portfolio} which is agnostic in terms of the first moment of expected returns (and, as such, usually more robust than most alternatives which try to estimate μ\boldsymbol{\mu} and often fail).

Usually, the traditional way is to estimate Σ\boldsymbol{\Sigma} and to invert it to get the MSR weights. However, several approaches aim at estimating Σ1\boldsymbol{\Sigma}^{-1} \textit{directly} and we present one of them below. We proceed one asset at a time, that is, one line of Σ1\boldsymbol{\Sigma}^{-1} at a time.
If we decompose the matrix Σ\mathbf{\Sigma} into:

Σ=[σ2ccC],\mathbf{\Sigma}= \left[\begin{array}{cc} \sigma^2 & \mathbf{c}' \\ \mathbf{c}& \mathbf{C}\end{array} \right],

classical partitioning results (e.g., Schur complements) imply $$\small \mathbf{\Sigma}^{-1}= \left[\begin{array}{cc} (\sigma^2 -\mathbf{c}‘\mathbf{C}^{-1}\mathbf{c})^{-1} & - (\sigma^2 -\mathbf{c}’\mathbf{C}^{-1}\mathbf{c})^{-1}\mathbf{c}'\mathbf{C}^{-1} \

  • (\sigma^2 -\mathbf{c}‘\mathbf{C}^{-1}\mathbf{c})^{-1}\mathbf{C}^{-1}\mathbf{c}& \mathbf{C}^{-1}+ (\sigma^2 -\mathbf{c}’\mathbf{C}^{-1}\mathbf{c})^{-1}\mathbf{C}^{-1}\mathbf{cc}‘\mathbf{C}^{-1}\end{array} \right].$Weareinterestedinthefirstline,whichhas2components:thefactor We are interested in the first line, which has 2 components: the factor (\sigma^2 -\mathbf{c}’\mathbf{C}^{-1}\mathbf{c})^{-1}andthelinevector and the line vector \mathbf{c}'\mathbf{C}^{-1}.. \mathbf{C}isthecovariancematrixofassets is the covariance matrix of assets 2to to Nand and \mathbf{c}isthecovariancebetweenthefirstassetandallotherassets.Thefirstlineof is the covariance between the first asset and all other assets. The first line of \mathbf{\Sigma}^{-1}$ is

(#eq:sparse1)(σ2cC1c)1[1cC1N1 terms].(\#eq:sparse1) (\sigma^2 -\mathbf{c}'\mathbf{C}^{-1}\mathbf{c})^{-1} \left[1 \quad \underbrace{-\mathbf{c}'\mathbf{C}^{-1}}_{N-1 \text{ terms}} \right].

We now consider an alternative setting. We regress the returns of the first asset on those of all other assets:

(#eq:sparseeq)r1,t=a1+n=2Nβ1nrn,t+ϵt, i.e., r1=a11T+R1β1+ϵ1,(\#eq:sparseeq) r_{1,t}=a_1+\sum_{n=2}^N\beta_{1|n}r_{n,t}+\epsilon_t, \quad \text{ i.e., } \quad \mathbf{r}_1=a_1\mathbf{1}_T+\mathbf{R}_{-1}\mathbf{\beta}_1+\epsilon_1,

where R1\mathbf{R}_{-1} gathers the returns of all assets except the first one. The OLS estimator for β1\mathbf{\beta}_1 is

(#eq:sparse2)β^1=C1c,(\#eq:sparse2) \hat{\mathbf{\beta}}_{1}=\mathbf{C}^{-1}\mathbf{c},

and this is the partitioned form (when a constant is included to the regression) stemming from the Frisch-Waugh-Lovell theorem (see chapter 3 in Greene (2018)).

In addition,

(#eq:sparse3)(1R2)σr12=σr12cC1c=σϵ12.(\#eq:sparse3) (1-R^2)\sigma_{\mathbf{r}_1}^2=\sigma_{\mathbf{r}_1}^2- \mathbf{c}'\mathbf{C}^{-1}\mathbf{c} =\sigma^2_{\epsilon_1}.

The proof of this last fact is given below.

With X\mathbf{X} being the concatenation of 1T\mathbf{1}_T with returns R1\mathbf{R}_{-1} and with y=r1\mathbf{y}=\mathbf{r}_1, the classical expression of the R2R^2 is

R2=1ϵϵTσY2=1yyβ^XXβ^TσY2=1yyyXβ^TσY2,R^2=1-\frac{\mathbf{\epsilon}'\mathbf{\epsilon}}{T\sigma_Y^2}=1-\frac{\mathbf{y}'\mathbf{y}-\hat{\mathbf{\beta}'}\mathbf{X}'\mathbf{X}\hat{\mathbf{\beta}}}{T\sigma_Y^2}=1-\frac{\mathbf{y}'\mathbf{y}-\mathbf{y}'\mathbf{X}\hat{\mathbf{\beta}}}{T\sigma_Y^2},

with fitted values Xβ^=a1^1T+R1C1c\mathbf{X}\hat{\mathbf{\beta}}= \hat{a_1}\mathbf{1}_T+\mathbf{R}_{-1}\mathbf{C}^{-1}\mathbf{c}. Hence,

Tσr12R2=Tσr12r1r1+a1^1Tr1+r1R1C1cT(1R2)σr12=r1r1a1^1Tr1(r~1+1T1TTr1)(R~1+1T1TTR1)C1cT(1R2)σr12=r1r1a1^1Tr1TcC1cr11T1TTR1C1cT(1R2)σr12=r1r1(1Tr1)2TTcC1c(1R2)σr12=σr12cC1c\begin{align*} T\sigma_{\mathbf{r}_1}^2R^2&=T\sigma_{\mathbf{r}_1}^2-\mathbf{r}'_1\mathbf{r}_1+\hat{a_1}\mathbf{1}'_T\mathbf{r}_1+\mathbf{r}'_1\mathbf{R}_{-1}\mathbf{C}^{-1}\mathbf{c} \\ T(1-R^2)\sigma_{\mathbf{r}_1}^2&=\mathbf{r}'_1\mathbf{r}_1-\hat{a_1}\mathbf{1}'_T\mathbf{r}_1-\left(\mathbf{\tilde{r}}_1+\frac{\mathbf{1}_T\mathbf{1}'_T}{T}\mathbf{r}_1\right)'\left(\tilde{\mathbf{R}}_{-1}+\frac{\mathbf{1}_T\mathbf{1}'_T}{T}\mathbf{R}_{-1}\right)\mathbf{C}^{-1}\mathbf{c} \\ T(1-R^2)\sigma_{\mathbf{r}_1}^2&=\mathbf{r}'_1\mathbf{r}_1-\hat{a_1}\mathbf{1}'_T\mathbf{r}_1-T\mathbf{c}'\mathbf{C}^{-1}\mathbf{c} -\mathbf{r}'_1\frac{\mathbf{1}_T\mathbf{1}'_T}{T}\mathbf{R}_{-1} \mathbf{C}^{-1}\mathbf{c} \\ T(1-R^2)\sigma_{\mathbf{r}_1}^2&=\mathbf{r}'_1\mathbf{r}_1-\frac{(\mathbf{1}'_T\mathbf{r}_1)^2}{T}- T\mathbf{c}'\mathbf{C}^{-1}\mathbf{c} \\ (1-R^2)\sigma_{\mathbf{r}_1}^2&=\sigma_{\mathbf{r}_1}^2- \mathbf{c}'\mathbf{C}^{-1}\mathbf{c} \end{align*}

where in the fourth equality we have plugged a^1=1TT(r1R1C1c)\hat{a}_1=\frac{\mathbf{1'}_T}{T}(\mathbf{r}_1-\mathbf{R}_{-1}\mathbf{C}^{-1}\mathbf{c}). Note that there is probably a simpler proof, see, e.g., section 3.5 in Greene (2018).

Combining (@ref(eq:sparse1), (@ref(eq:sparse2)) and (@ref(eq:sparse3)), we get that the first line of Σ1\mathbf{\Sigma}^{-1} is equal to

(#eq:sparsehedgeeq2)1σϵ12×[1β^1].(\#eq:sparsehedgeeq2) \frac{1}{\sigma^2_{\epsilon_1}}\times \left[ 1 \quad -\hat{\boldsymbol{\beta}}_1'\right].

Given the first line of Σ1\mathbf{\Sigma}^{-1}, it suffices to multiply by μ\boldsymbol{\mu} to get the portfolio weight in the first asset (up to a scaling constant).

There is a nice economic intuition behind the above results which justifies the term “sparse hedging”.\index{sparse hedging} We take the case of the minimum variance portfolio,\index{minimum variance portfolio} for which μ=1\boldsymbol{\mu}=\boldsymbol{1}. In Equation @ref(eq:sparseeq), we try to explain the return of asset 1 with that of all other assets. In the above equation, up to a scaling constant, the portfolio has a unit position in the first asset and β^1-\hat{\boldsymbol{\beta}}_1 positions in all other assets. Hence, the purpose of all other assets is clearly to hedge the return of the first one. In fact, these positions are aimed at minimizing the squared errors of the aggregate portfolio for the first asset (these errors are exactly ϵ1\mathbf{\epsilon}_1). Moreover, the scaling factor σϵ12\sigma^{-2}_{\epsilon_1} is also simple to interpret: the more we trust the regression output (because of a small σϵ12\sigma^{2}_{\epsilon_1}), the more we invest in the hedging portfolio of the asset.

This reasoning is easily generalized for any line of Σ1\mathbf{\Sigma}^{-1}, which can be obtained by regressing the returns of asset ii on the returns of all other assets. If the allocation scheme has the form (@ref(eq:MSR)) for given values of μ\boldsymbol{\mu}, then the pseudo-code for the sparse portfolio strategy is the following.

At each date (which we omit for notational convenience),

  • For all stocks ii,

  1. estimate the elasticnet regression over the t=1,,Tt=1,\dots,T samples to get the ithi^{th} line of Σ^1\hat{\mathbf{\Sigma}}^{-1}:

    [Σ^1]i,=argminβi{t=1T(ri,tai+niNβinrn,t)2+λαβi1+λ(1α)βi22}\small \left[\hat{\mathbf{\Sigma}}^{-1}\right]_{i,\cdot}= \underset{\mathbf{\beta}_{i|}}{\text{argmin}}\, \left\{\sum_{t=1}^T\left( r_{i,t}-a_i+\sum_{n\neq i}^N\beta_{i|n}r_{n,t}\right)^2+\lambda \alpha || \mathbf{\beta}_{i|}||_1+\lambda (1-\alpha)||\mathbf{\beta}_{i|}||_2^2\right\}
  2. to get the weights of asset ii, we compute the μ\mathbf{\mu}-weighted sum: wi=σϵi2(μijiβijμj)w_i= \sigma_{\epsilon_i}^{-2}\left(\mu_i- \sum_{j\neq i}\mathbf{\beta}_{i|j}\mu_j\right),

where we recall that the vectors βi=[βi1,,βii1,βii+1,,βiN]\mathbf{\beta}_{i|}=[\mathbf{\beta}_{i|1},\dots,\mathbf{\beta}_{i|i-1},\mathbf{\beta}_{i|i+1},\dots,\mathbf{\beta}_{i|N}] are the coefficients from regressing the returns of asset ii against the returns of all other assets.
The introduction of the penalization norms is the new ingredient, compared to the original approach of Stevens (1998). The benefits are twofold: first, introducing constraints yields weights that are more robust and less subject to errors in the estimates of μ\mathbf{\mu}; second, because of sparsity, weights are more stable, less leveraged and thus the strategy is less impacted by transaction costs. Before we turn to numerical applications, we mention a more direct route to the estimation of a robust inverse covariance matrix: the Graphical LASSO. The GLASSO estimates the precision matrix (inverse covariance matrix) via maximum likelihood while imposing constraints/penalizations on the weights of the matrix. When the penalization is strong enough, this yields a sparse matrix, i.e., a matrix in which some and possibly many coefficients are zero. We refer to the original article Friedman et al. (2008) for more details on this subject.

5.2.2Example

The interest of sparse hedging portfolios\index{sparse hedging} is to propose a robust approach to the estimation of minimum variance policies.\index{minimum variance portfolio} Indeed, since the vector of expected returns μ\boldsymbol{\mu} is usually very noisy, a simple solution is to adopt an agnostic view by setting μ=1\boldsymbol{\mu}=\boldsymbol{1}. In order to test the added value of the sparsity constraint, we must resort to a full backtest.\index{backtest} In doing so, we anticipate the content of Chapter @ref(backtest).

We first prepare the variables. Sparse portfolios are based on returns only; we thus base our analysis on the dedicated variable in matrix/rectangular format (returns) which were created at the end of Chapter @ref(notdata).

Then, we initialize the output variables: portfolio weights and portfolio returns. We want to compare three strategies: an equally weighted (EW)\index{equally weighted portfolio} benchmark of all stocks, the classical global minimum variance portfolio (GMV) and the sparse-hedging approach to minimum variance.

# Out-of-sample dates
t_oos = returns.loc[returns['date'] > separation_date, 'date'].unique()
t_oos = pd.to_datetime(t_oos)                                          # Transform to datetime format

Tt = len(t_oos)                                                        # Nb of dates
nb_port = 3                                                            # Nb of portfolios/strats.
n_assets = returns.shape[1] - 1                                        # Number of assets (excluding date)
portf_weights = np.zeros((Tt, nb_port, n_assets))                      # Initial portf. weights
portf_returns = np.zeros((Tt, nb_port))                                # Initial portf. returns

Next, because it is the purpose of this section, we isolate the computation of the weights of sparse-hedging portfolios. In the case of minimum variance portfolios, when μ=1\boldsymbol{\mu}=\boldsymbol{1}, the weight in asset 1 will simply be the sum of all terms in Equation @ref(eq:sparsehedgeeq2) and the other weights have similar forms.


def weights_sparsehedge(returns, alpha, lambda_param):  # The parameters are defined here
    """Compute sparse hedging portfolio weights."""
    n_assets = returns.shape[1]
    w = np.zeros(n_assets)                              # Initiate weights
    
    for i in range(n_assets):                           # Loop on the assets
        y = returns[:, i]                               # Dependent variable
        x = np.delete(returns, i, axis=1)               # Independent variables (all except i)
        
        # ElasticNet in sklearn: alpha = lambda, l1_ratio = alpha (mixing parameter)
        fit = ElasticNet(alpha=lambda_param, l1_ratio=alpha, fit_intercept=True)
        fit.fit(x, y)
        
        err = y - fit.predict(x)                        # Prediction errors
        w[i] = (1 - np.sum(fit.coef_)) / np.var(err)    # Output: weight of asset i
    
    return w / np.sum(w)                                # Normalisation of weights

We define a meta-weighting function that embeds three strategies.

def weights_multi(returns, j, alpha, lambda_param):
    """Compute portfolio weights for different strategies."""
    N = returns.shape[1]
    
    if j == 1:                                          # j = 1 => EW
        return np.ones(N) / N
    
    if j == 2:                                          # j = 2 => Minimum Variance
        sigma = np.cov(returns, rowvar=False) + 0.01 * np.eye(N)  # Covariance matrix + regularizing term
        w = np.linalg.solve(sigma, np.ones(N))          # Inverse & multiply
        return w / np.sum(w)                            # Normalize
    
    if j == 3:                                          # j = 3 => Penalised / elasticnet
        return weights_sparsehedge(returns, alpha, lambda_param)

Finally, we proceed to the backtesting loop.

for t in range(len(t_oos)):                                              # Loop = rebal. dates
    # Data for weights (expanding window)
    temp_data = returns.loc[returns['date'] < t_oos[t]].drop(columns='date').values
    
    # OOS returns
    realised_returns = returns.loc[returns['date'] == t_oos[t]].drop(columns='date').values.flatten()
    
    for j in range(1, nb_port + 1):                                      # Loop over strats
        portf_weights[t, j-1, :] = weights_multi(temp_data, j, 0.1, 0.1) # Hard-coded params!
        portf_returns[t, j-1] = np.sum(portf_weights[t, j-1, :] * realised_returns)  # Portf. returns

# Convert to DataFrame with column names
portf_returns_df = pd.DataFrame(portf_returns, columns=["EW", "MV", "Sparse"])

# Portfolio volatilities (monthly scale)
print(portf_returns_df.std())
EW        0.059407
MV        0.039333
Sparse    0.050662
dtype: float64

The aim of the sparse hedging\index{sparse hedging} restrictions is to provide a better estimate of the covariance structure of assets so that the estimation of minimum variance portfolio weights is more accurate. From the above exercise, we see that the monthly volatility is indeed reduced when building covariance matrices based on sparse hedging relationships. This is not the case if we use the shrunk sample covariance matrix because there is probably too much noise in the estimates of correlations between assets. Working with daily returns would likely improve the quality of the estimates. But the above backtest shows that the penalized methodology performs well even when the number of observations (dates) is small compared to the number of assets.

5.3Predictive regressions

\index{predictive regression}

5.3.1Literature review and principle

The topic of predictive regressions sits on a collection of very interesting articles. One influential contribution is Stambaugh (1999), where the author shows the perils of regressions in which the independent variables are autocorrelated. In this case, the usual OLS estimate is biased and must therefore be corrected. The results have since then been extended in numerous directions (see Campbell & Yogo (2006) and Hjalmarsson (2011), the survey in Gonzalo & Pitarakis (2018) and, more recently, the study of Xu (2020) on predictability over multiple horizons).

A second important topic pertains to the time-dependence of the coefficients in predictive regressions. One contribution in this direction is Dangl & Halling (2012), where coefficients are estimated via a Bayesian procedure. More recently Kelly et al. (2019) use time-dependent factor loadings to model the cross-section of stock returns. The time-varying nature of coefficients of predictive regressions is further documented by Henkel et al. (2011) for short term returns. Lastly, Farmer et al. (2023) introduce the concept of pockets of predictability: assets or markets experience different phases; in some stages, they are predictable and in some others, they aren’t. Pockets are measured both by the number of days that a t-statistic is above a particular threshold and by the magnitude of the R2R^2 over the considered period. Formal statistical tests are developed by Demetrescu et al. (2021).

The introduction of penalization within predictive regressions goes back at least to Rapach et al. (2013), where they are used to assess lead-lag relationships between US markets and other international stock exchanges.\index{penalized regression} More recently, Chinco et al. (2019) use LASSO regressions to forecast high frequency returns based on past returns (in the cross-section) at various horizons. They report statistically significant gains. Han et al. (2022) and Rapach & Zhou (2021) use LASSO and elasticnet regressions (respectively) to improve forecast combinations and single out the characteristics that matter when explaining stock returns. Recently, Lee et al. (2022) introduce small variations on the LASSO aimed at improving coefficient estimation consistency.

These contributions underline the relevance of the overlap between predictive regressions\index{penalized regression} and penalized regressions. In simple machine-learning based asset pricing, we often seek to build models such as that of Equation @ref(eq:genML). If we stick to a linear relationship and add penalization terms, then the model becomes:

rt+1,n=αn+k=1Kβnkft,nk+ϵt+1,n,s.t.(1α)j=1Jβj+αj=1Jβj2<θr_{t+1,n} = \alpha_n + \sum_{k=1}^K\beta_n^kf^k_{t,n}+\epsilon_{t+1,n}, \quad \text{s.t.} \quad (1-\alpha)\sum_{j=1}^J |\beta_j| +\alpha\sum_{j=1}^J \beta_j^2< \theta

where we use ft,nkf^k_{t,n} or xt,nkx_{t,n}^k interchangeably and θ\theta is some penalization intensity. Again, one of the aims of the regularization is to generate more robust estimates. If the patterns extracted hold out of sample, then

r^t+1,n=α^n+k=1Kβ^nkft,nk,\hat{r}_{t+1,n} = \hat{\alpha}_n + \sum_{k=1}^K\hat{\beta}_n^kf^k_{t,n},

will be a relatively reliable proxy of future performance.

5.3.2Code and results

Given the form of our dataset, implementing penalized predictive regressions is easy.

training_sample = data_ml[data_ml['date'] <= separation_date].dropna(subset=(features_short + ['R1M']))
testing_sample = data_ml[data_ml['date'] > separation_date].dropna(subset=(features_short + ['R1M']))

y_penalized_train = training_sample['R1M'].values                  # Dependent variable
x_penalized_train = training_sample[features_short].values                   # Predictors

# Model: ElasticNet with alpha (lambda) = 0.1 and l1_ratio (alpha) = 0.1
fit_pen_pred = ElasticNet(alpha=0.1, l1_ratio=0.1, fit_intercept=True)
fit_pen_pred.fit(x_penalized_train, y_penalized_train)
Loading...

We then report two key performance measures: the mean squared error and the hit ratio, which is the proportion of times that the prediction guesses the sign of the return correctly. A detailed account of metrics is given later in the book (Chapter @ref(backtest)).

x_penalized_test = testing_sample[features_short].values                     # Predictors
y_test = testing_sample['R1M'].values

predictions = fit_pen_pred.predict(x_penalized_test)

# MSE
mse = np.mean((predictions - y_test) ** 2)
print(f"MSE: {mse}")

# Hit ratio
hit_ratio = np.mean(predictions * y_test > 0)
print(f"Hit ratio: {hit_ratio}")
MSE: 0.01020271805963522
Hit ratio: 0.5518135351648892

5.4Coding exercise

On the test sample, evaluate the impact of the two elastic net parameters on out-of-sample accuracy.

References
  1. Han, Y., He, A., Rapach, D., & Zhou, G. (2022). Firm Characteristics and Expected Stock Returns. Review of Financial Studies, 35(2), 882–921.
  2. Rapach, D., & Zhou, G. (2021). Time-Series and Cross-Sectional Stock Return Forecasting: New Machine Learning Methods. Journal of Financial Economics, 141(2), 399–424.
  3. Uematsu, Y., & Tanaka, S. (2019). High-dimensional macroeconomic forecasting and variable selection via penalized regression. Econometrics Journal, 22(1), 34–56.
  4. Stevens, G. V. (1998). On the inverse of the covariance matrix in portfolio analysis. Journal of Finance, 53(5), 1821–1827.
  5. d’Aspremont, A. (2011). Identifying small mean-reverting portfolios. Quantitative Finance, 11(3), 351–364.
  6. Ban, G.-Y., El Karoui, N., & Lim, A. E. (2016). Machine learning and portfolio optimization. Management Science, 64(3), 1136–1154.
  7. Kremer, P. J., Lee, S., Bogdan, M., & Paterlini, S. (2019). Sparse Portfolio Selection via the sorted l1-Norm. Journal of Banking & Finance, 105687.
  8. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal Of The Royal Statistical Society. Series B (Methodological), 267–288.
  9. Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1), 77–91.
  10. Freyberger, J., Neuhierl, A., & Weber, M. (2020). Dissecting Characteristics Nonparametrically. Review of Financial Studies, 33(5), 2326–2377.
  11. Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. F. Didot.
  12. Greene, W. H. (2018). Econometric analysis, Eighth Edition. Pearson Education.
  13. Hastie, T. (2020). Ridge regression: an essential concept in data science. arXiv Preprint, 2006.00371.
  14. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal Of The Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
  15. Brodie, J., Daubechies, I., De Mol, C., Giannone, D., & Loris, I. (2009). Sparse and stable Markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30), 12267–12272.