Data cleaning¶
This script is used to show the data cleaning applied to Yahoo Finance data.
The procedure has many steps, with a lot of heuristics. You can see it
by inspecting the code of cvxportfolio.YahooFinance
(and its parent
classes). All the heuristic steps are configurable with class-level constants,
which can be easily overridden by subclassing for your specific usecase. None
of this matters if you use Cvxportfolio with user-provided market data, but
it’s crucial for our example strategies (which use this interface, and rely
on the data cleaning).
This is not really an example, and one method shown here
(cvxportfolio.YahooFinance._get_data_yahoo()
) is not public,
so not covered by the semantic versioning agreeement (it could change
without notice).
from pathlib import Path
from tempfile import TemporaryDirectory
from time import sleep
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import cvxportfolio as cvx
# Here you can put any stocks for which you wish to analyze the cleaning;
# Some names with known issues:
TEST_UNIVERSE = ['SMT.L', 'NVR', 'HUBB', 'NWG.L', 'BA.L']
# Or, pick a larger universe
# from .universes import *
# TEST_UNIVERSE = DOW30
ALL_DROPPED_ROWS_PCT = pd.Series(dtype=float)
ALL_MIN_LR = pd.Series(dtype=float)
ALL_MAX_LR = pd.Series(dtype=float)
PLOT = True
SLEEP = 1
figures = {}
for stock in TEST_UNIVERSE:
sleep(SLEEP)
print(f'\n\t{stock}:')
# This method is not public, it only downloads historical data from
# Yahoo Finance; all the cleaning is applied downstream
# pylint: disable=protected-access
raw_yfinance = cvx.YahooFinance._get_data_yahoo(stock)
print(f'{stock}: YAHOO FINANCE RAW')
print(raw_yfinance)
with TemporaryDirectory() as datadir:
cvx_cleaned = cvx.YahooFinance(
stock, base_location=Path(datadir)).data
print(f'{stock}: CVXPORTFOLIO CLEANED')
print(cvx_cleaned)
yf_log10r = np.log10(raw_yfinance.adjclose).diff().shift(-1)
cvx_log10r = np.log10(1 + cvx_cleaned['return'])
if PLOT:
figures[stock], axes = plt.subplots(
3, figsize=(10/1.62, 10), layout='constrained')
raw_yfinance.iloc[:, :5].plot(ax=axes[0])
axes[0].set_yscale('log')
axes[0].set_title(f'{stock}: RAW YAHOO FINANCE')
cvx_cleaned.iloc[:, :4].plot(ax=axes[1])
axes[1].set_title(f'{stock}: CVXPORTFOLIO CLEANED DATA')
axes[1].set_yscale('log')
(yf_log10r.cumsum() - yf_log10r.sum()).plot(
label='Yahoo Finance total close-to-close', ax=axes[2])
(cvx_log10r.cumsum() - cvx_log10r.sum()).plot(
label='Cvxportfolio total open-to-open', ax=axes[2])
axes[2].set_title(f'{stock}: CUMULATIVE LOG10 RETURNS (SCALED)')
axes[2].legend()
assert cvx_cleaned.index[-1] == raw_yfinance.index[-1]
print()
dropped_rows = len(raw_yfinance) - len(cvx_cleaned)
dropped_rows_pct = dropped_rows / len(raw_yfinance)
ALL_DROPPED_ROWS_PCT.loc[stock] = dropped_rows_pct*100
print(f'Cvxportfolio dropped {int(dropped_rows_pct*100)}% of rows')
ALL_MIN_LR.loc[stock] = np.log(1+cvx_cleaned['return']).min()
ALL_MAX_LR.loc[stock] = np.log(1+cvx_cleaned['return']).max()
print('Max Cvxportfolio logreturn:', ALL_MAX_LR.loc[stock])
print('Min Cvxportfolio logreturn:', ALL_MIN_LR.loc[stock] )
print('How many zero volumes:',
(cvx_cleaned['valuevolume'] == 0.).mean())
print('\nCvxportfolio dropped rows %:')
print(ALL_DROPPED_ROWS_PCT.sort_values().tail())
print('\nCvxportfolio min logreturns:')
print(ALL_MIN_LR.sort_values().head())
print('\nCvxportfolio max logreturns:')
print(ALL_MAX_LR.sort_values().tail())
This is the output printed to screen when executing this script. You can see many diagnostics of the data cleaning. Scroll down to the pictures to see what is actually going on.
SMT.L:
SMT.L: YAHOO FINANCE RAW
open low ... adjclose volume
1968-12-31 07:00:00+00:00 0.048520 0.048520 ... 3.536882e-08 0
1969-01-01 07:00:00+00:00 0.048520 0.048520 ... 3.536882e-08 0
1969-01-02 07:00:00+00:00 0.048520 0.048520 ... 3.536882e-08 0
1969-01-03 07:00:00+00:00 0.048520 0.048520 ... 3.536882e-08 0
1969-01-06 07:00:00+00:00 0.048520 0.048520 ... 3.536882e-08 0
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 886.599976 881.000000 ... 8.940000e+02 14701666
2024-04-02 07:00:00+00:00 895.799988 866.000000 ... 8.660000e+02 3155097
2024-04-03 07:00:00+00:00 866.599976 850.000000 ... 8.740000e+02 3897419
2024-04-04 07:00:00+00:00 870.000000 860.599976 ... 8.840000e+02 4624038
2024-04-05 07:00:00+00:00 864.400024 857.200012 ... 8.576000e+02 2519777
[14267 rows x 6 columns]
SMT.L: CVXPORTFOLIO CLEANED
open low ... return valuevolume
1980-01-31 08:00:00+00:00 4.440000 4.440000 ... 0.000000 0.000000e+00
1980-02-01 08:00:00+00:00 4.440000 4.440000 ... 0.000000 0.000000e+00
1980-02-04 08:00:00+00:00 4.440000 4.440000 ... 0.000000 0.000000e+00
1980-02-05 08:00:00+00:00 4.440000 4.440000 ... 0.000000 0.000000e+00
1980-02-06 08:00:00+00:00 4.440000 4.440000 ... 0.000000 0.000000e+00
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 886.599976 881.000000 ... 0.010377 1.303450e+10
2024-04-02 07:00:00+00:00 895.799988 866.000000 ... -0.032597 2.826336e+09
2024-04-03 07:00:00+00:00 866.599976 850.000000 ... 0.003923 3.377503e+09
2024-04-04 07:00:00+00:00 870.000000 860.599976 ... -0.006437 4.022913e+09
2024-04-05 07:00:00+00:00 864.400024 NaN ... NaN NaN
[11375 rows x 6 columns]
Cvxportfolio dropped 20% of rows
Max Cvxportfolio logreturn: 0.23484603393311956
Min Cvxportfolio logreturn: -0.32681042362385965
How many zero volumes: 0.5425054945054945
NVR:
NVR: YAHOO FINANCE RAW
open low ... adjclose volume
1985-07-22 13:30:00+00:00 30.625000 30.250000 ... 23.394072 1036000
1985-07-23 13:30:00+00:00 30.625000 30.000000 ... 23.201525 41800
1985-07-24 13:30:00+00:00 30.000000 29.250000 ... 22.720165 16900
1985-07-25 13:30:00+00:00 29.250000 29.250000 ... 23.105253 42600
1985-07-26 13:30:00+00:00 29.750000 29.625000 ... 23.008982 27300
... ... ... ... ... ...
2024-04-01 13:30:00+00:00 8097.979980 7978.279785 ... 8026.270020 14400
2024-04-02 13:30:00+00:00 7925.339844 7810.009766 ... 7852.899902 17900
2024-04-03 13:30:00+00:00 7840.000000 7823.450195 ... 7914.299805 18400
2024-04-04 13:30:00+00:00 8000.000000 7780.870117 ... 7790.000000 15300
2024-04-05 13:30:00+00:00 7801.379883 7801.379883 ... 7907.890137 12600
[9755 rows x 6 columns]
NVR: CVXPORTFOLIO CLEANED
open low ... return valuevolume
1993-11-01 14:30:00+00:00 10.000000 10.000000 ... 0.012500 1.120000e+05
1993-11-02 14:30:00+00:00 10.125000 10.000000 ... 0.000000 4.242375e+05
1993-11-03 14:30:00+00:00 10.125000 10.125000 ... 0.000000 1.518750e+04
1993-11-04 14:30:00+00:00 10.125000 10.000000 ... -0.024691 7.188750e+04
1993-11-05 14:30:00+00:00 9.875000 9.875000 ... 0.000000 2.804500e+05
... ... ... ... ... ...
2024-04-01 13:30:00+00:00 8097.979980 7978.279785 ... -0.021319 1.166109e+08
2024-04-02 13:30:00+00:00 7925.339844 7810.009766 ... -0.010768 1.418636e+08
2024-04-03 13:30:00+00:00 7840.000000 7823.450195 ... 0.020408 1.442560e+08
2024-04-04 13:30:00+00:00 8000.000000 7780.870117 ... -0.024828 1.224000e+08
2024-04-05 13:30:00+00:00 7801.379883 NaN ... NaN NaN
[7661 rows x 6 columns]
Cvxportfolio dropped 21% of rows
Max Cvxportfolio logreturn: 0.18744967555078326
Min Cvxportfolio logreturn: -0.18146307446035254
How many zero volumes: 0.0005221250489492234
HUBB:
HUBB: YAHOO FINANCE RAW
open low ... adjclose volume
1972-06-05 13:30:00+00:00 3.746001 3.746001 ... 0.201951 0.0
1972-06-06 13:30:00+00:00 NaN NaN ... NaN NaN
1972-06-07 13:30:00+00:00 NaN NaN ... NaN NaN
1972-06-08 13:30:00+00:00 NaN NaN ... NaN NaN
1972-06-09 13:30:00+00:00 NaN NaN ... NaN NaN
... ... ... ... ... ...
2024-04-01 13:30:00+00:00 415.619995 413.924988 ... 416.799988 263665.0
2024-04-02 13:30:00+00:00 415.089996 409.779999 ... 415.540009 348782.0
2024-04-03 13:30:00+00:00 415.209991 411.329987 ... 424.399994 428768.0
2024-04-04 13:30:00+00:00 427.410004 411.239990 ... 412.559998 369897.0
2024-04-05 13:30:00+00:00 416.440002 416.295013 ... 417.079987 449636.0
[13071 rows x 6 columns]
HUBB: CVXPORTFOLIO CLEANED
open low ... return valuevolume
1994-11-30 14:30:00+00:00 26.190475 26.072001 ... -2.270621e-03 6.270000e+05
1994-12-01 14:30:00+00:00 26.131001 25.535999 ... -1.813415e-02 1.327977e+06
1994-12-02 14:30:00+00:00 25.657143 25.594999 ... 7.052988e-03 3.356724e+06
1994-12-05 14:30:00+00:00 25.838095 25.715000 ... -6.416082e-08 8.139000e+05
1994-12-06 14:30:00+00:00 25.838095 25.594999 ... -7.003362e-03 3.532326e+06
... ... ... ... ... ...
2024-04-01 13:30:00+00:00 415.619995 413.924988 ... -1.275200e-03 1.095844e+08
2024-04-02 13:30:00+00:00 415.089996 409.779999 ... 2.890822e-04 1.447759e+08
2024-04-03 13:30:00+00:00 415.209991 411.329987 ... 2.938275e-02 1.780288e+08
2024-04-04 13:30:00+00:00 427.410004 411.239990 ... -2.566622e-02 1.580977e+08
2024-04-05 13:30:00+00:00 416.440002 NaN ... NaN NaN
[7388 rows x 6 columns]
Cvxportfolio dropped 43% of rows
Max Cvxportfolio logreturn: 0.1468187851208631
Min Cvxportfolio logreturn: -0.13926177333953668
How many zero volumes: 0.0025717379534380075
NWG.L:
NWG.L: YAHOO FINANCE RAW
open low ... adjclose volume
1988-07-01 07:00:00+00:00 453.543701 453.543701 ... 184.401917 0
1988-07-04 07:00:00+00:00 449.846405 449.846405 ... 182.898712 0
1988-07-05 07:00:00+00:00 456.008606 456.008606 ... 185.404160 0
1988-07-06 07:00:00+00:00 449.846405 449.846405 ... 182.898712 0
1988-07-07 07:00:00+00:00 443.684113 443.684113 ... 180.393204 0
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 265.000000 263.700012 ... 265.500000 25164728
2024-04-02 07:00:00+00:00 265.100006 263.700012 ... 267.100006 31032477
2024-04-03 07:00:00+00:00 266.700012 266.399994 ... 272.700012 34901617
2024-04-04 07:00:00+00:00 274.399994 273.500000 ... 280.500000 49149789
2024-04-05 07:00:00+00:00 277.500000 277.500000 ... 278.399994 20411707
[9179 rows x 6 columns]
NWG.L: CVXPORTFOLIO CLEANED
open low ... return valuevolume
1988-07-01 07:00:00+00:00 453.543701 453.543701 ... -0.008152 0.000000e+00
1988-07-04 07:00:00+00:00 449.846405 449.846405 ... 0.013698 0.000000e+00
1988-07-05 07:00:00+00:00 456.008606 456.008606 ... -0.013513 0.000000e+00
1988-07-06 07:00:00+00:00 449.846405 449.846405 ... -0.013699 0.000000e+00
1988-07-07 07:00:00+00:00 443.684113 443.684113 ... 0.000000 0.000000e+00
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 265.000000 263.700012 ... 0.000377 6.668653e+09
2024-04-02 07:00:00+00:00 265.100006 263.700012 ... 0.006035 8.226710e+09
2024-04-03 07:00:00+00:00 266.700012 266.399994 ... 0.028871 9.308262e+09
2024-04-04 07:00:00+00:00 274.399994 273.500000 ... 0.011297 1.348670e+10
2024-04-05 07:00:00+00:00 277.500000 NaN ... NaN NaN
[9179 rows x 6 columns]
Cvxportfolio dropped 0% of rows
Max Cvxportfolio logreturn: 0.3883847969606009
Min Cvxportfolio logreturn: -0.9817912489871412
How many zero volumes: 0.20590478265606274
BA.L:
BA.L: YAHOO FINANCE RAW
open low ... adjclose volume
1988-07-01 07:00:00+00:00 100.918602 100.918602 ... 34.023872 0
1988-07-04 07:00:00+00:00 100.918602 100.918602 ... 34.023872 0
1988-07-05 07:00:00+00:00 102.384804 102.384804 ... 34.518188 0
1988-07-06 07:00:00+00:00 100.674301 100.674301 ... 33.941509 0
1988-07-07 07:00:00+00:00 98.963799 98.963799 ... 33.364830 0
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 1355.000000 1344.000000 ... 1349.500000 6976596
2024-04-02 07:00:00+00:00 1354.500000 1326.500000 ... 1331.000000 7158300
2024-04-03 07:00:00+00:00 1328.500000 1302.000000 ... 1320.500000 9447110
2024-04-04 07:00:00+00:00 1317.500000 1310.250000 ... 1313.000000 5379241
2024-04-05 07:00:00+00:00 1307.000000 1303.500000 ... 1326.000000 8904552
[9179 rows x 6 columns]
BA.L: CVXPORTFOLIO CLEANED
open low ... return valuevolume
1988-07-01 07:00:00+00:00 100.918602 100.918602 ... 0.000000 0.000000e+00
1988-07-04 07:00:00+00:00 100.918602 100.918602 ... 0.014529 0.000000e+00
1988-07-05 07:00:00+00:00 102.384804 102.384804 ... -0.016707 0.000000e+00
1988-07-06 07:00:00+00:00 100.674301 100.674301 ... -0.016990 0.000000e+00
1988-07-07 07:00:00+00:00 98.963799 98.963799 ... 0.101234 0.000000e+00
... ... ... ... ... ...
2024-03-28 08:00:00+00:00 1355.000000 1344.000000 ... -0.000369 9.453288e+09
2024-04-02 07:00:00+00:00 1354.500000 1326.500000 ... -0.019195 9.695917e+09
2024-04-03 07:00:00+00:00 1328.500000 1302.000000 ... -0.008280 1.255049e+10
2024-04-04 07:00:00+00:00 1317.500000 1310.250000 ... -0.007970 7.087150e+09
2024-04-05 07:00:00+00:00 1307.000000 NaN ... NaN NaN
[9179 rows x 6 columns]
Cvxportfolio dropped 0% of rows
Max Cvxportfolio logreturn: 0.2876819961285629
Min Cvxportfolio logreturn: -0.5659161019764781
How many zero volumes: 0.0725569234121364
Cvxportfolio dropped rows %:
NWG.L 0.000000
BA.L 0.000000
SMT.L 20.270554
NVR 21.465915
HUBB 43.477928
dtype: float64
Cvxportfolio min logreturns:
NWG.L -0.981791
BA.L -0.565916
SMT.L -0.326810
NVR -0.181463
HUBB -0.139262
dtype: float64
Cvxportfolio max logreturns:
HUBB 0.146819
NVR 0.187450
SMT.L 0.234846
BA.L 0.287682
NWG.L 0.388385
dtype: float64
And these are the figure that are plotted. For each of the test stocks, we see the original Yahoo Finance data, which has many issues, and the cleaned data that is produced, and used, by Cvxportfolio.
Data cleaning process for stock 'SMT.L'
:
Data cleaning process for stock 'NVR'
:
Data cleaning process for stock 'HUBB'
:
Data cleaning process for stock 'NWG.L'
:
Data cleaning process for stock 'BA.L'
: