Data and risk model estimates¶
This example script is available in the repository. See the docstring below for its explanation.
This is a translation of the original IPython notebook which doesn’t run any more for reasons explained below.
"""This is the restored version of the IPython notebook
used to download data for the original 2016 examples:
https://github.com/cvxgrp/cvxportfolio/blob/0.0.X/examples/DataEstimatesRiskModel.ipynb
The main reasons why the original can't be used any more:
- the data source we used, "Quandl wiki" historical stock market data,
is defunct. Fortunately we saved the data in this repo, which is the
reason why it's quite big, so we can reproduce the original results.
- Pandas dropped its Panel data structure (after which it is named!); we
used it to store the risk models. We now use multi-indexed Dataframes.
The data files that we use from the notebook above in 2017 are:
- ``returns.csv.gz``: Historical close-to-close total returns; note that today
:class:`cvxportfolio.DownloadedMarketData` instead computes
historical open-to-open total returns, matching the model
from the paper.
- ``volumes.csv.gz``: historical market volumes in USD, computed as volumes
in number of shares times the adjusted close price.
These are for a selection of stocks from the components of SP500 in 2016.
The data cleaning logic is shown in the notebook. We don't use the
following files from that notebook:
- ``sigmas.csv.gz``: now computed automatically by
:class:`cvxportfolio.TransactionCost`; note that intraday returns were
used in the notebook, while now we use open-to-open returns. This may
cause some noticeable difference in the market impact term of the
transaction cost estimates. In practice, the market impact term is either
negligible for small to medium investors, or needs to be tuned for the
given assets (using historical realized costs) in case of a large investor.
- ``prices.csv.gz``: unused
- ``sigma_estimate.csv.gz``: now computed automatically by
:class:`cvxportfolio.TransactionCost`
- ``volume_estimate.csv.gz``: now computed automatically by
:class:`cvxportfolio.TransactionCost`
"""
from pathlib import Path
import numpy as np
import pandas as pd
import cvxportfolio as cvx
def paper_market_data():
"""Build market data server for the paper's examples.
The returns dataframe already includes the cash returns column
(as its last), we point the ``cash_key`` argument to its name.
We also use the ``min_history`` argument, which, howerver ,in this case is
not needed (since we start our back-tests after the default minimum
history of one year).
:return: Market data for the paper.
:rtype: :class:`cvxportfolio.UserProvidedMarketData`
"""
returns = pd.read_csv(
Path(__file__).parent / 'returns.csv.gz', index_col=0, parse_dates=[0])
volumes = pd.read_csv(
Path(__file__).parent / 'volumes.csv.gz', index_col=0, parse_dates=[0])
# print(returns)
return cvx.UserProvidedMarketData(
returns=returns, volumes=volumes,
cash_key='USDOLLAR', min_history=pd.Timedelta('0d'))
def paper_risk_model():
"""Build low-rank risk model for the paper's examples.
This is mostly a copy-paste of the last cell of the original IPython
notebook. The differences are that we reshape the data into multi-indexed
dataframes and we explicitely skip the cash column.
:return: Factor exposures, factor covariances, idyosincratic risks.
:rtype: pandas.DataFrame, pandas.DataFrame, pandas.DataFrame
"""
k = 15
start_t = "2012-01-01"
end_t = "2016-12-31"
returns = pd.read_csv(
Path(__file__).parent / 'returns.csv.gz',
index_col=0, parse_dates=[0]).iloc[:, :-1] # skip cash column
first_days_month = pd.date_range(
start=returns.index[next(
i for (i, el) in enumerate(returns.index >= start_t) if el)-1],
end=returns.index[-1], freq='MS')
print('Computing risk model...')
factor_exposures = pd.DataFrame(
index = pd.MultiIndex.from_product([first_days_month, range(k)]),
columns = returns.columns,
)
factor_sigma = pd.DataFrame(
index = pd.MultiIndex.from_product([first_days_month, range(k)]),
columns = range(k),
)
idyosincratic = pd.DataFrame(
index = first_days_month, columns = returns.columns)
for day in first_days_month:
used_returns = returns.loc[
(returns.index < day) & (returns.index >= day-pd.Timedelta(
"730 days"))]
second_moment = (
used_returns.values.T @ used_returns.values
/ used_returns.values.shape[0])
eival, eivec = np.linalg.eigh(second_moment)
factor_sigma.loc[day] = np.diag(eival[-k:])
factor_exposures.loc[day] = eivec[:, -k:].T
idyosincratic.loc[day] = np.diag(
eivec[:, :-k] @ np.diag(eival[:-k]) @ eivec[:, :-k].T)
return factor_exposures, factor_sigma, idyosincratic
if __name__ == '__main__':
md = paper_market_data()
print('Trading calendar:')
print(md.trading_calendar())
print('Returns:')
print(md.returns)
print('Volumes:')
print(md.volumes)
factor_exposures, factor_sigma, idyosincratic = paper_risk_model()
print("Factor exposures:")
print(factor_exposures)
print("Factor covariances:")
print(factor_sigma)
print("Idyosincratic risks:")
print(idyosincratic)