Data cleaning

This script is used to show the data cleaning applied to Yahoo Finance data.

The procedure has many steps, with a lot of heuristics. You can see it by inspecting the code of cvxportfolio.YahooFinance (and its parent classes). All the heuristic steps are configurable with class-level constants, which can be easily overridden by subclassing for your specific usecase. None of this matters if you use Cvxportfolio with user-provided market data, but it’s crucial for our example strategies (which use this interface, and rely on the data cleaning).

This is not really an example, and one method shown here (cvxportfolio.YahooFinance._get_data_yahoo()) is not public, so not covered by the semantic versioning agreeement (it could change without notice).

from pathlib import Path
from tempfile import TemporaryDirectory
from time import sleep

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import cvxportfolio as cvx

# Here you can put any stocks for which you wish to analyze the cleaning;
# Some names with known issues:
TEST_UNIVERSE = ['SMT.L', 'NVR', 'HUBB', 'NWG.L', 'BA.L']

# Or, pick a larger universe
# from .universes import *
# TEST_UNIVERSE = DOW30

ALL_DROPPED_ROWS_PCT = pd.Series(dtype=float)
ALL_MIN_LR = pd.Series(dtype=float)
ALL_MAX_LR = pd.Series(dtype=float)

PLOT = True
SLEEP = 1

figures = {}

for stock in TEST_UNIVERSE:
    sleep(SLEEP)
    print(f'\n\t{stock}:')

    # This method is not public, it only downloads historical data from
    # Yahoo Finance; all the cleaning is applied downstream
    # pylint: disable=protected-access
    raw_yfinance = cvx.YahooFinance._get_data_yahoo(stock)
    print(f'{stock}: YAHOO FINANCE RAW')
    print(raw_yfinance)

    with TemporaryDirectory() as datadir:
        cvx_cleaned = cvx.YahooFinance(
            stock, base_location=Path(datadir)).data
    print(f'{stock}: CVXPORTFOLIO CLEANED')
    print(cvx_cleaned)

    yf_log10r = np.log10(raw_yfinance.adjclose).diff().shift(-1)
    cvx_log10r = np.log10(1 + cvx_cleaned['return'])

    if PLOT:
        figures[stock], axes = plt.subplots(
            3, figsize=(10/1.62, 10), layout='constrained')

        raw_yfinance.iloc[:, :5].plot(ax=axes[0])
        axes[0].set_yscale('log')
        axes[0].set_title(f'{stock}: RAW YAHOO FINANCE')

        cvx_cleaned.iloc[:, :4].plot(ax=axes[1])
        axes[1].set_title(f'{stock}: CVXPORTFOLIO CLEANED DATA')
        axes[1].set_yscale('log')

        (yf_log10r.cumsum() - yf_log10r.sum()).plot(
            label='Yahoo Finance total close-to-close', ax=axes[2])
        (cvx_log10r.cumsum() - cvx_log10r.sum()).plot(
            label='Cvxportfolio total open-to-open', ax=axes[2])
        axes[2].set_title(f'{stock}: CUMULATIVE LOG10 RETURNS (SCALED)')
        axes[2].legend()

    assert cvx_cleaned.index[-1] == raw_yfinance.index[-1]

    print()
    dropped_rows = len(raw_yfinance) - len(cvx_cleaned)
    dropped_rows_pct = dropped_rows / len(raw_yfinance)
    ALL_DROPPED_ROWS_PCT.loc[stock] = dropped_rows_pct*100
    print(f'Cvxportfolio dropped {int(dropped_rows_pct*100)}% of rows')

    ALL_MIN_LR.loc[stock] = np.log(1+cvx_cleaned['return']).min()
    ALL_MAX_LR.loc[stock] = np.log(1+cvx_cleaned['return']).max()

    print('Max Cvxportfolio logreturn:', ALL_MAX_LR.loc[stock])
    print('Min Cvxportfolio logreturn:', ALL_MIN_LR.loc[stock] )
    print('How many zero volumes:',
        (cvx_cleaned['valuevolume'] == 0.).mean())

print('\nCvxportfolio dropped rows %:')
print(ALL_DROPPED_ROWS_PCT.sort_values().tail())

print('\nCvxportfolio min logreturns:')
print(ALL_MIN_LR.sort_values().head())

print('\nCvxportfolio max logreturns:')
print(ALL_MAX_LR.sort_values().tail())

This is the output printed to screen when executing this script. You can see many diagnostics of the data cleaning. Scroll down to the pictures to see what is actually going on.

	SMT.L:
SMT.L: YAHOO FINANCE RAW
                                 open         low  ...      adjclose    volume
1968-12-31 07:00:00+00:00    0.048520    0.048520  ...  3.536882e-08         0
1969-01-01 07:00:00+00:00    0.048520    0.048520  ...  3.536882e-08         0
1969-01-02 07:00:00+00:00    0.048520    0.048520  ...  3.536882e-08         0
1969-01-03 07:00:00+00:00    0.048520    0.048520  ...  3.536882e-08         0
1969-01-06 07:00:00+00:00    0.048520    0.048520  ...  3.536882e-08         0
...                               ...         ...  ...           ...       ...
2024-03-28 08:00:00+00:00  886.599976  881.000000  ...  8.940000e+02  14701666
2024-04-02 07:00:00+00:00  895.799988  866.000000  ...  8.660000e+02   3155097
2024-04-03 07:00:00+00:00  866.599976  850.000000  ...  8.740000e+02   3897419
2024-04-04 07:00:00+00:00  870.000000  860.599976  ...  8.840000e+02   4624038
2024-04-05 07:00:00+00:00  864.400024  857.200012  ...  8.576000e+02   2519777

[14267 rows x 6 columns]
SMT.L: CVXPORTFOLIO CLEANED
                                 open         low  ...    return   valuevolume
1980-01-31 08:00:00+00:00    4.440000    4.440000  ...  0.000000  0.000000e+00
1980-02-01 08:00:00+00:00    4.440000    4.440000  ...  0.000000  0.000000e+00
1980-02-04 08:00:00+00:00    4.440000    4.440000  ...  0.000000  0.000000e+00
1980-02-05 08:00:00+00:00    4.440000    4.440000  ...  0.000000  0.000000e+00
1980-02-06 08:00:00+00:00    4.440000    4.440000  ...  0.000000  0.000000e+00
...                               ...         ...  ...       ...           ...
2024-03-28 08:00:00+00:00  886.599976  881.000000  ...  0.010377  1.303450e+10
2024-04-02 07:00:00+00:00  895.799988  866.000000  ... -0.032597  2.826336e+09
2024-04-03 07:00:00+00:00  866.599976  850.000000  ...  0.003923  3.377503e+09
2024-04-04 07:00:00+00:00  870.000000  860.599976  ... -0.006437  4.022913e+09
2024-04-05 07:00:00+00:00  864.400024         NaN  ...       NaN           NaN

[11375 rows x 6 columns]

Cvxportfolio dropped 20% of rows
Max Cvxportfolio logreturn: 0.23484603393311956
Min Cvxportfolio logreturn: -0.32681042362385965
How many zero volumes: 0.5425054945054945

	NVR:
NVR: YAHOO FINANCE RAW
                                  open          low  ...     adjclose   volume
1985-07-22 13:30:00+00:00    30.625000    30.250000  ...    23.394072  1036000
1985-07-23 13:30:00+00:00    30.625000    30.000000  ...    23.201525    41800
1985-07-24 13:30:00+00:00    30.000000    29.250000  ...    22.720165    16900
1985-07-25 13:30:00+00:00    29.250000    29.250000  ...    23.105253    42600
1985-07-26 13:30:00+00:00    29.750000    29.625000  ...    23.008982    27300
...                                ...          ...  ...          ...      ...
2024-04-01 13:30:00+00:00  8097.979980  7978.279785  ...  8026.270020    14400
2024-04-02 13:30:00+00:00  7925.339844  7810.009766  ...  7852.899902    17900
2024-04-03 13:30:00+00:00  7840.000000  7823.450195  ...  7914.299805    18400
2024-04-04 13:30:00+00:00  8000.000000  7780.870117  ...  7790.000000    15300
2024-04-05 13:30:00+00:00  7801.379883  7801.379883  ...  7907.890137    12600

[9755 rows x 6 columns]
NVR: CVXPORTFOLIO CLEANED
                                  open          low  ...    return   valuevolume
1993-11-01 14:30:00+00:00    10.000000    10.000000  ...  0.012500  1.120000e+05
1993-11-02 14:30:00+00:00    10.125000    10.000000  ...  0.000000  4.242375e+05
1993-11-03 14:30:00+00:00    10.125000    10.125000  ...  0.000000  1.518750e+04
1993-11-04 14:30:00+00:00    10.125000    10.000000  ... -0.024691  7.188750e+04
1993-11-05 14:30:00+00:00     9.875000     9.875000  ...  0.000000  2.804500e+05
...                                ...          ...  ...       ...           ...
2024-04-01 13:30:00+00:00  8097.979980  7978.279785  ... -0.021319  1.166109e+08
2024-04-02 13:30:00+00:00  7925.339844  7810.009766  ... -0.010768  1.418636e+08
2024-04-03 13:30:00+00:00  7840.000000  7823.450195  ...  0.020408  1.442560e+08
2024-04-04 13:30:00+00:00  8000.000000  7780.870117  ... -0.024828  1.224000e+08
2024-04-05 13:30:00+00:00  7801.379883          NaN  ...       NaN           NaN

[7661 rows x 6 columns]

Cvxportfolio dropped 21% of rows
Max Cvxportfolio logreturn: 0.18744967555078326
Min Cvxportfolio logreturn: -0.18146307446035254
How many zero volumes: 0.0005221250489492234

	HUBB:
HUBB: YAHOO FINANCE RAW
                                 open         low  ...    adjclose    volume
1972-06-05 13:30:00+00:00    3.746001    3.746001  ...    0.201951       0.0
1972-06-06 13:30:00+00:00         NaN         NaN  ...         NaN       NaN
1972-06-07 13:30:00+00:00         NaN         NaN  ...         NaN       NaN
1972-06-08 13:30:00+00:00         NaN         NaN  ...         NaN       NaN
1972-06-09 13:30:00+00:00         NaN         NaN  ...         NaN       NaN
...                               ...         ...  ...         ...       ...
2024-04-01 13:30:00+00:00  415.619995  413.924988  ...  416.799988  263665.0
2024-04-02 13:30:00+00:00  415.089996  409.779999  ...  415.540009  348782.0
2024-04-03 13:30:00+00:00  415.209991  411.329987  ...  424.399994  428768.0
2024-04-04 13:30:00+00:00  427.410004  411.239990  ...  412.559998  369897.0
2024-04-05 13:30:00+00:00  416.440002  416.295013  ...  417.079987  449636.0

[13071 rows x 6 columns]
HUBB: CVXPORTFOLIO CLEANED
                                 open         low  ...        return   valuevolume
1994-11-30 14:30:00+00:00   26.190475   26.072001  ... -2.270621e-03  6.270000e+05
1994-12-01 14:30:00+00:00   26.131001   25.535999  ... -1.813415e-02  1.327977e+06
1994-12-02 14:30:00+00:00   25.657143   25.594999  ...  7.052988e-03  3.356724e+06
1994-12-05 14:30:00+00:00   25.838095   25.715000  ... -6.416082e-08  8.139000e+05
1994-12-06 14:30:00+00:00   25.838095   25.594999  ... -7.003362e-03  3.532326e+06
...                               ...         ...  ...           ...           ...
2024-04-01 13:30:00+00:00  415.619995  413.924988  ... -1.275200e-03  1.095844e+08
2024-04-02 13:30:00+00:00  415.089996  409.779999  ...  2.890822e-04  1.447759e+08
2024-04-03 13:30:00+00:00  415.209991  411.329987  ...  2.938275e-02  1.780288e+08
2024-04-04 13:30:00+00:00  427.410004  411.239990  ... -2.566622e-02  1.580977e+08
2024-04-05 13:30:00+00:00  416.440002         NaN  ...           NaN           NaN

[7388 rows x 6 columns]

Cvxportfolio dropped 43% of rows
Max Cvxportfolio logreturn: 0.1468187851208631
Min Cvxportfolio logreturn: -0.13926177333953668
How many zero volumes: 0.0025717379534380075

	NWG.L:
NWG.L: YAHOO FINANCE RAW
                                 open         low  ...    adjclose    volume
1988-07-01 07:00:00+00:00  453.543701  453.543701  ...  184.401917         0
1988-07-04 07:00:00+00:00  449.846405  449.846405  ...  182.898712         0
1988-07-05 07:00:00+00:00  456.008606  456.008606  ...  185.404160         0
1988-07-06 07:00:00+00:00  449.846405  449.846405  ...  182.898712         0
1988-07-07 07:00:00+00:00  443.684113  443.684113  ...  180.393204         0
...                               ...         ...  ...         ...       ...
2024-03-28 08:00:00+00:00  265.000000  263.700012  ...  265.500000  25164728
2024-04-02 07:00:00+00:00  265.100006  263.700012  ...  267.100006  31032477
2024-04-03 07:00:00+00:00  266.700012  266.399994  ...  272.700012  34901617
2024-04-04 07:00:00+00:00  274.399994  273.500000  ...  280.500000  49149789
2024-04-05 07:00:00+00:00  277.500000  277.500000  ...  278.399994  20411707

[9179 rows x 6 columns]
NWG.L: CVXPORTFOLIO CLEANED
                                 open         low  ...    return   valuevolume
1988-07-01 07:00:00+00:00  453.543701  453.543701  ... -0.008152  0.000000e+00
1988-07-04 07:00:00+00:00  449.846405  449.846405  ...  0.013698  0.000000e+00
1988-07-05 07:00:00+00:00  456.008606  456.008606  ... -0.013513  0.000000e+00
1988-07-06 07:00:00+00:00  449.846405  449.846405  ... -0.013699  0.000000e+00
1988-07-07 07:00:00+00:00  443.684113  443.684113  ...  0.000000  0.000000e+00
...                               ...         ...  ...       ...           ...
2024-03-28 08:00:00+00:00  265.000000  263.700012  ...  0.000377  6.668653e+09
2024-04-02 07:00:00+00:00  265.100006  263.700012  ...  0.006035  8.226710e+09
2024-04-03 07:00:00+00:00  266.700012  266.399994  ...  0.028871  9.308262e+09
2024-04-04 07:00:00+00:00  274.399994  273.500000  ...  0.011297  1.348670e+10
2024-04-05 07:00:00+00:00  277.500000         NaN  ...       NaN           NaN

[9179 rows x 6 columns]

Cvxportfolio dropped 0% of rows
Max Cvxportfolio logreturn: 0.3883847969606009
Min Cvxportfolio logreturn: -0.9817912489871412
How many zero volumes: 0.20590478265606274

	BA.L:
BA.L: YAHOO FINANCE RAW
                                  open          low  ...     adjclose   volume
1988-07-01 07:00:00+00:00   100.918602   100.918602  ...    34.023872        0
1988-07-04 07:00:00+00:00   100.918602   100.918602  ...    34.023872        0
1988-07-05 07:00:00+00:00   102.384804   102.384804  ...    34.518188        0
1988-07-06 07:00:00+00:00   100.674301   100.674301  ...    33.941509        0
1988-07-07 07:00:00+00:00    98.963799    98.963799  ...    33.364830        0
...                                ...          ...  ...          ...      ...
2024-03-28 08:00:00+00:00  1355.000000  1344.000000  ...  1349.500000  6976596
2024-04-02 07:00:00+00:00  1354.500000  1326.500000  ...  1331.000000  7158300
2024-04-03 07:00:00+00:00  1328.500000  1302.000000  ...  1320.500000  9447110
2024-04-04 07:00:00+00:00  1317.500000  1310.250000  ...  1313.000000  5379241
2024-04-05 07:00:00+00:00  1307.000000  1303.500000  ...  1326.000000  8904552

[9179 rows x 6 columns]
BA.L: CVXPORTFOLIO CLEANED
                                  open          low  ...    return   valuevolume
1988-07-01 07:00:00+00:00   100.918602   100.918602  ...  0.000000  0.000000e+00
1988-07-04 07:00:00+00:00   100.918602   100.918602  ...  0.014529  0.000000e+00
1988-07-05 07:00:00+00:00   102.384804   102.384804  ... -0.016707  0.000000e+00
1988-07-06 07:00:00+00:00   100.674301   100.674301  ... -0.016990  0.000000e+00
1988-07-07 07:00:00+00:00    98.963799    98.963799  ...  0.101234  0.000000e+00
...                                ...          ...  ...       ...           ...
2024-03-28 08:00:00+00:00  1355.000000  1344.000000  ... -0.000369  9.453288e+09
2024-04-02 07:00:00+00:00  1354.500000  1326.500000  ... -0.019195  9.695917e+09
2024-04-03 07:00:00+00:00  1328.500000  1302.000000  ... -0.008280  1.255049e+10
2024-04-04 07:00:00+00:00  1317.500000  1310.250000  ... -0.007970  7.087150e+09
2024-04-05 07:00:00+00:00  1307.000000          NaN  ...       NaN           NaN

[9179 rows x 6 columns]

Cvxportfolio dropped 0% of rows
Max Cvxportfolio logreturn: 0.2876819961285629
Min Cvxportfolio logreturn: -0.5659161019764781
How many zero volumes: 0.0725569234121364

Cvxportfolio dropped rows %:
NWG.L     0.000000
BA.L      0.000000
SMT.L    20.270554
NVR      21.465915
HUBB     43.477928
dtype: float64

Cvxportfolio min logreturns:
NWG.L   -0.981791
BA.L    -0.565916
SMT.L   -0.326810
NVR     -0.181463
HUBB    -0.139262
dtype: float64

Cvxportfolio max logreturns:
HUBB     0.146819
NVR      0.187450
SMT.L    0.234846
BA.L     0.287682
NWG.L    0.388385
dtype: float64

And these are the figure that are plotted. For each of the test stocks, we see the original Yahoo Finance data, which has many issues, and the cleaned data that is produced, and used, by Cvxportfolio.

Data cleaning process for stock 'SMT.L':

Data cleaning process for stock ``'SMT.L'``

Data cleaning process for stock 'NVR':

Data cleaning process for stock 'NVR'

Data cleaning process for stock 'HUBB':

Data cleaning process for stock 'HUBB'

Data cleaning process for stock 'NWG.L':

Data cleaning process for stock 'NWG.L'

Data cleaning process for stock 'BA.L':

Data cleaning process for stock 'BA.L'