Data Interfaces

This module include classes that download, store, and serve market data.

The two main abstractions are SymbolData and MarketData. Neither are exposed outside this module. Their derived classes instead are. If you want to interface cvxportfolio with financial data source other than the ones we provide, you should derive from either of those two classes.

Single-symbol data download and storage

class cvxportfolio.YahooFinance(symbol, storage_backend='pickle', base_location=PosixPath('/home/docs/cvxportfolio_data'), grace_period=Timedelta('1 days 00:00:00'))View on GitHub

Yahoo Finance symbol data.

Added in version 1.2.0: The data cleaning logic has been significantly improved, see the data_cleaning.py example to view what’s done on any given name (or enable 'INFO' logging messages). It is recommended to delete the ~/cvxportfolio_data folder with data files downloaded by previous Cvxportfolio versions.

Parameters:
  • symbol (str) – The symbol that we downloaded.

  • storage_backend (str) – The storage backend, implemented ones are 'pickle', 'csv', and 'sqlite'.

  • base_storage_location (pathlib.Path) – The location of the storage. We store in a subdirectory named after the class which derives from this.

  • grace_period (pandas.Timedelta) – If the most recent observation in the data is less old than this we do not download new data.

Attribute data:

The downloaded, and cleaned, data for the symbol.

class cvxportfolio.Fred(symbol, storage_backend='pickle', base_location=PosixPath('/home/docs/cvxportfolio_data'), grace_period=Timedelta('1 days 00:00:00'))View on GitHub

Fred single-symbol data.

Parameters:
  • symbol (str) – The symbol that we downloaded.

  • storage_backend (str) – The storage backend, implemented ones are 'pickle', 'csv', and 'sqlite'. By default 'pickle'.

  • base_storage_location (pathlib.Path) – The location of the storage. We store in a subdirectory named after the class which derives from this. By default it’s a directory named cvxportfolio_data in your home folder.

  • grace_period (pandas.Timedelta) – If the most recent observation in the data is less old than this we do not download new data. By default it’s one day.

Attribute data:

The downloaded data for the symbol.

Market data servers

class cvxportfolio.UserProvidedMarketData(returns, volumes=None, prices=None, copy_dataframes=True, trading_frequency=None, min_history=Timedelta('365 days 05:45:36'), base_location=PosixPath('/home/docs/cvxportfolio_data'), grace_period=Timedelta('1 days 00:00:00'), cash_key='USDOLLAR', online_usage=False, universe_selection_in_time=None)View on GitHub

User-provided market data.

Added in version 1.3.0: The new parameter universe_selection_in_time used to optionally exclude assets from the trading universe at different points in time.

Parameters:
  • returns (pandas.DataFrame) – Historical open-to-open returns. The return at time \(t\) is \(r_t = p_{t+1}/p_t -1\) where \(p_t\) is the (open) price at time \(t\). Must have datetime index. You can also include cash returns as its last column, and set cash_key below to the last column’s name.

  • volumes (pandas.DataFrame or None) – Historical market volumes, expressed in units of value (e.g., US dollars).

  • prices (pandas.DataFrame or None) – Historical open prices (e.g., used for rounding trades in the MarketSimulator).

  • trading_frequency (str or None) – Instead of using frequency implied by the index of the returns, down-sample all dataframes. We implement 'weekly', 'monthly', 'quarterly' and 'annual'. By default (None) don’t down-sample.

  • min_history (pandas.Timedelta) – Minimum amount of time for which the returns are not np.nan before each assets enters in a back-test.

  • base_location (pathlib.Path) – The location of the storage, only used in case it downloads the cash returns. By default it’s a directory named cvxportfolio_data in your home folder.

  • cash_key (str) – Name of the cash account. If not the last column of the provided returns, it will be added. In that case you should make sure your provided dataframes have a timezone aware datetime index. Its returns are the risk-free rate. If not in the original dataframe, choice of USDOLLAR, EURO, JPYEN, GBPOUND (Cvxportfolio downloads the historical central bank rates from FRED), or cash, which sets the cash returns to 0. Default USDOLLAR.

  • online_usage (bool) – Disable removal of assets that have np.nan returns for the given time. Default False.

  • universe_selection_in_time (pd.DataFrame or None) – Boolean dataframe used to specify which assets are to be included in the trading universe at each point in time. The columns are the full universe (same columns as prices, volumes, or returns without the last column, cash). The index is datetime and, differently from the usual convention, needs not to to be the same as the index of the other dataframes: at each point in time of a back-test the last valid observation (before, or at, the trading time) is selected. (You still need to provide a timezoned index if the returns’ index is, otherwise time comparisons can’t be done.) The entries are boolean; True means that the corresponding asset can be invested in at the time, False that it can’t. Note that this is more fundamental than imposing time-varying position limit, for example, in an optimization-based policy. Non-investable assets are removed by the MarketSimulator; their positions converted to cash, the result.BacktestResult dataframes will have np.nan on those dates for those assets’, and the policy is re-compiled without those assets. You shouldn’t use it to make frequent changes to the trading universe (if that’s your usecase), but rather impose time-varying position limits via constraints.MinWeights and constraints.MaxWeights. Also, note that the filtering implied by min_history is still applied (in addition to this), as well as non-nan returns for the period (which can be disabled by online_usage). Default, None, don’t use this filtering.

serve(t)View on GitHub

Serve data for policy and simulator at time \(t\).

Parameters:

t (pandas.Timestamp) – Time of execution, e.g., stock market open of a given day.

Returns:

(past_returns, current_returns, past_volumes, current_volumes, current_prices)

Return type:

(pandas.DataFrame, pandas.Series, pandas.DataFrame or None, pandas.Series or None, pandas.Series or None)

trading_calendar(start_time=None, end_time=None, include_end=True)View on GitHub

Get trading calendar from market data.

Parameters:
  • start_time (pandas.Timestamp) – Initial time of the trading calendar. Always inclusive if present. If None, use the first available time.

  • end_time (pandas.Timestamp) – Final time of the trading calendar. If None, use the last available time.

  • include_end (bool) – Include end time.

Returns:

Trading calendar.

Return type:

pandas.DatetimeIndex

class cvxportfolio.DownloadedMarketData(universe=(), datasource='YahooFinance', cash_key='USDOLLAR', base_location=PosixPath('/home/docs/cvxportfolio_data'), storage_backend='pickle', min_history=Timedelta('365 days 05:45:36'), grace_period=Timedelta('1 days 00:00:00'), trading_frequency=None, online_usage=False, universe_selection_in_time=None)View on GitHub

Market data that is downloaded.

Added in version 1.3.0: The new parameter universe_selection_in_time used to optionally exclude assets from the trading universe at different points in time.

Parameters:
  • universe (list) – List of names as understood by the data source used, e.g., ['AAPL', 'GOOG'] if using the default Yahoo Finance data source.

  • datasource (str or SymbolData class) – The data source used.

  • cash_key (str) – Name of the cash account, its rates will be added as last column of the returns, as the risk-free rate. Choice of USDOLLAR, EURO,``JPYEN``, GBPOUND (Cvxportfolio downloads the historical central bank rates from FRED), or cash, which sets the cash returns to 0. Default USDOLLAR.

  • base_location (pathlib.Path) – The location of the storage. By default it’s a directory named cvxportfolio_data in your home folder.

  • storage_backend (str) – The storage backend, implemented ones are 'pickle', 'csv', and 'sqlite'. By default 'pickle'.

  • min_history (pandas.Timedelta) – Minimum amount of time for which the returns are not np.nan before each assets enters in a back-test.

  • grace_period (pandas.Timedelta) – If the most recent observation of each symbol’s data is less old than this we do not download new data. By default it’s one day.

  • trading_frequency (str or None) – Instead of using frequency implied by the index of the returns, down-sample all dataframes. We implement 'weekly', 'monthly', 'quarterly' and 'annual'. By default (None) don’t down-sample.

  • online_usage (bool) – Disable removal of assets that have np.nan returns for the given time. Default False.

  • universe_selection_in_time (pd.DataFrame or None) – Boolean dataframe used to specify which assets are to be included in the trading universe at each point in time. The columns are the full universe. The index is datetime and, differently from the usual convention, needs not to to be the same as the index of the other dataframes: at each point in time of a back-test the last valid observation (before, or at, the trading time) is selected. (You still need to provide a timezoned index if the returns’ index is, otherwise time comparisons can’t be done.) The entries are boolean; True means that the corresponding asset can be invested in at the time, False that it can’t. Note that this is more fundamental than imposing time-varying position limit, for example, in an optimization-based policy. Non-investable assets are removed by the MarketSimulator; their positions converted to cash, the result.BacktestResult dataframes will have np.nan on those dates for those assets’, and the policy is re-compiled without those assets. You shouldn’t use it to make frequent changes to the trading universe (if that’s your usecase), but rather impose time-varying position limits via constraints.MinWeights and constraints.MaxWeights. Also, note that the filtering implied by min_history is still applied (in addition to this), as well as non-nan returns for the period (which can be disabled by online_usage). Default, None, don’t use this filtering.

serve(t)View on GitHub

Serve data for policy and simulator at time \(t\).

Parameters:

t (pandas.Timestamp) – Time of execution, e.g., stock market open of a given day.

Returns:

(past_returns, current_returns, past_volumes, current_volumes, current_prices)

Return type:

(pandas.DataFrame, pandas.Series, pandas.DataFrame or None, pandas.Series or None, pandas.Series or None)

trading_calendar(start_time=None, end_time=None, include_end=True)View on GitHub

Get trading calendar from market data.

Parameters:
  • start_time (pandas.Timestamp) – Initial time of the trading calendar. Always inclusive if present. If None, use the first available time.

  • end_time (pandas.Timestamp) – Final time of the trading calendar. If None, use the last available time.

  • include_end (bool) – Include end time.

Returns:

Trading calendar.

Return type:

pandas.DatetimeIndex

Base classes (for using other data sources)

class cvxportfolio.data.SymbolData(symbol, storage_backend='pickle', base_location=PosixPath('/home/docs/cvxportfolio_data'), grace_period=Timedelta('1 days 00:00:00'))View on GitHub

Base class for a single symbol time series data.

The data is either in the form of a Pandas Series or DataFrame and has datetime index.

This class needs to be derived. At a minimum, one should redefine the _download method, which implements the downloading of the symbol’s time series from an external source. The method takes the current (already downloaded and stored) data and is supposed to only append to it. In this way we only store new data and don’t modify already downloaded data.

Additionally one can redefine the _preload method, which prepares data to serve to the user (so the data is stored in a different format than what the user sees.) We found that this separation can be useful.

This class interacts with module-level functions named _loader_BACKEND and _storer_BACKEND, where BACKEND is the name of the storage system used. We define pickle, csv, and sqlite backends. These may have limitations. See their docstrings for more information.

Parameters:
  • symbol (str) – The symbol that we downloaded.

  • storage_backend (str) – The storage backend, implemented ones are 'pickle', 'csv', and 'sqlite'. By default 'pickle'.

  • base_location (pathlib.Path) – The location of the storage. We store in a subdirectory named after the class which derives from this. By default it’s a directory named cvxportfolio_data in your home folder.

  • grace_period (pandas.Timedelta) – If the most recent observation in the data is less old than this we do not download new data. By default it’s one day.

Attribute data:

The downloaded data for the symbol.

class cvxportfolio.data.MarketDataView on GitHub

Prepare, hold, and serve market data.

Method serve:

Serve data for policy and simulator at time \(t\).

serve(t)View on GitHub

Serve data for policy and simulator at time \(t\).

Parameters:

t (pandas.Timestamp) – Trading time. It must be included in the timestamps returned by trading_calendar().

Returns:

past_returns, current_returns, past_volumes, current_volumes, current_prices

Return type:

(pd.DataFrame, pd.Series, pd.DataFrame, pd.Series, pd.Series)

universe_at_time(t)View on GitHub

Return trading universe at given time.

Base implementation simply returns the index of current_returns returned by serve(). Typically a more efficient implementation can be made available.

Parameters:

t (pandas.Timestamp) – Trading time. It must be included in the timestamps returned by trading_calendar().

Returns:

Trading universe at time t.

Return type:

pd.Index

trading_calendar(start_time=None, end_time=None, include_end=True)View on GitHub

Get trading calendar between times.

Parameters:
  • start_time (pandas.Timestamp) – Initial time of the trading calendar. Always inclusive if present. If None, use the first available time.

  • end_time (pandas.Timestamp) – Final time of the trading calendar. If None, use the last available time.

  • include_end (bool) – Include end time.

Returns:

Trading calendar.

Return type:

pd.DatetimeIndex

property periods_per_yearView on GitHub

Average trading periods per year.

Return type:

int

property full_universeView on GitHub

Full universe, which might not be available for trading.

Returns:

Full universe.

Return type:

pd.Index

partial_universe_signature(partial_universe)View on GitHub

Unique signature of this instance with a partial universe.

A partial universe is a subset of the full universe that is available at some time for trading.

This is used in cvxportfolio.cache to sign back-test caches that are saved on disk. If not redefined it returns None which disables on-disk caching.

Parameters:

partial_universe (pandas.Index) – A subset of the full universe.

Returns:

Signature.

Return type:

str