Data Loading¶

Introduction¶

deepdow offers multiple utility functions and classes that turn raw data into tensors used by Layers and Losses.

See below a scheme of the overall datamodel (starting at the top)

We dedicate an entire section to each of the elements.

Raw data¶

Let us assume, that our raw data raw_df is stored in a pd.DataFrame. There are n_timesteps rows representing different timesteps with the same time frequency but potentially with gaps (due to non-business days etc.). They are indexed by pd.DatetimeIndex. The columns are indexed by pd.MultiIndex where the first level represents the the n_assets different assets. The second level then represents the n_channels channels (indicators) like volume or close price. For the rest of the this page we will be using the below example

Asset	MSFT	MSFT	AAPL	AAPL
Channel	Close	Volume	Close	Volume
2016-01-04 00:00:00	54.80	53778000.00	105.35	67649400.00
2016-01-05 00:00:00	55.05	34079700.00	102.71	55791000.00
2016-01-06 00:00:00	54.05	39518900.00	100.70	68457400.00
2016-01-07 00:00:00	52.17	56564900.00	96.45	81094400.00
2016-01-08 00:00:00	52.33	48754000.00	96.96	70798000.00
2016-01-11 00:00:00	52.30	36943800.00	98.53	49739400.00
2016-01-12 00:00:00	52.78	36095500.00	99.96	49154200.00
2016-01-13 00:00:00	51.64	66883600.00	97.39	62439600.00
2016-01-14 00:00:00	53.11	52381900.00	99.52	63170100.00
2016-01-15 00:00:00	50.99	71820700.00	97.13	79833900.00
2016-01-19 00:00:00	50.56	43564500.00	96.66	53087700.00
2016-01-20 00:00:00	50.79	63273000.00	96.79	72334400.00
2016-01-21 00:00:00	50.48	40191200.00	96.30	52161500.00
2016-01-22 00:00:00	52.29	37555800.00	101.42	65800500.00
2016-01-25 00:00:00	51.79	34707700.00	99.44	51794500.00
2016-01-26 00:00:00	52.17	28900800.00	99.99	75077000.00
2016-01-27 00:00:00	51.22	36775200.00	93.42	133369700.00
2016-01-28 00:00:00	52.06	62513800.00	94.09	55678800.00
2016-01-29 00:00:00	55.09	83611700.00	97.34	64416500.00
2016-02-01 00:00:00	54.71	44208500.00	96.43	40943500.00

assert isinstance(raw_df, pd.DataFrame)
assert isinstance(raw_df.index, pd.DatetimeIndex)
assert isinstance(raw_df.columns, pd.MultiIndex)
assert raw_df.shape == (20, 4)

raw_to_Xy¶

The quickest way to get going given raw_df is to use the deepdow.utils.raw_to_Xy function. It performs the following steps

exclusion of undesired assets and channels (see included_assets and included_indicators)
adding missing rows - timestamps implied by the specified frequency freq
filling missing values (forward fill followed by backward fill)
computation of returns (if use_log then logarithmic else simple) - the first timestep is automatically deleted
running the rolling window (see Basics) given lookback, gap and horizon

We get the following outputs

X - numpy array of shape (n_samples, n_channels, lookback, n_assets) representing features
timestamps- list of length n_samples representing timestamp of each sample
y - numpy array of shape (n_samples, n_channels, horizon, n_assets) representing targets
asset_names - list of length n_assets representing asset names
indicators - list of length n_channels representing channel / indicator names

Note that in our example n_samples = n_timesteps - lookback - horizon - gap + 1 since there is a single missing day (2016-01-18) w.r.t. the default B frequency that is going to be forward filled.

from deepdow.utils import raw_to_Xy


n_timesteps = len(raw_df)  # 20
n_channels = len(raw_df.columns.levels[0])  # 2
n_assets = len(raw_df.columns.levels[1])  # 2

lookback, gap, horizon = 5, 2, 4

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      freq="B",
                                                      horizon=horizon)

n_samples =  n_timesteps - lookback - horizon - gap + 1  # 10

assert X.shape == (n_samples, n_channels, lookback, n_assets)
assert timestamps[0] == raw_df.index[lookback]
assert asset_names == ['AAPL', 'MSFT']
assert indicators == ['Close', 'Volume']

InRAMDataset¶

The next step is to start migrating our custom lists and numpy arrays to native PyTorch classes. For more details see Official tutorial. First of all, deepdow implements its own subclass of torch.utils.data.Dataset called InRAMDataset. Its goal is to encapsulate the above generated X, y, timestamps and asset_names and define per sample loading.

from deepdow.data import InRAMDataset

dataset = InRAMDataset(X, y, timestamps=timestamps, asset_names=asset_names)

X_sample, y_sample, timestamp_sample, asset_names = dataset[0]

assert isinstance(dataset, torch.utils.data.Dataset)
assert len(dataset) == 10

assert torch.is_tensor(X_sample)
assert X_sample.shape == (2, 5, 2)  # (n_channels, lookback, n_assets)

assert torch.is_tensor(y_sample)
assert y_sample.shape == (2, 4, 2)  # (n_channels, horizon, n_assets)

assert timestamp_sample == timestamps[0]

Additionally, one can pass a transformation transform that can serve as preprocessing or data augmentation. Currently implemented transforms under deepdow.data are

Compose - basically a copy of Compose from Torch Vision
Dropout - randomly setting elements to zero
Multiply - multiplying all elements by a constant
Noise - add Gaussian noise
Scale - centering and scaling (similar to scikit-learn StandardScaler and RobustScaler)

All of the transforms are not in place.

Dataloaders¶

The last ingredient in the data pipeline are dataloaders. Their goal is to stream batches of samples for training and validation. deepdow provides two options

RigidDataLoader - lookback, horizon and assets are constant over different batches
FlexibleDataLoader - lookback, horizon and assets can change over different batches

Both of them are subclassing torch.utils.data.DataLoader and therefore inherit its functionality. One important example is the batch_size parameter. However, they also add new functionality. Notably one can use the parameter indices to specify which samples of the original dataset are going to be streamed. The train, validation and test split can be performed via this parameter. Last but not least they both have its specific parameters that we describe in the following subsections.

RigidDataLoader¶

This dataloader streams batches without making fundamental changes to X_batch or y_batch.

The samples are shuffled

The shapes are

X_batch.shape = (batch_size, n_channels, lookback, n_assets)

y_batch.shape = (batch_size, n_channels, horizon, n_assets)

len(timestamps_batch) = batch_size

len(asset_names_batch) = n_assets

at construction one can redefine lookback, horizon and asset_ixs to create a new subset

from deepdow.data import RigidDataLoader

torch.manual_seed(1)
batch_size = 4

dataloader = RigidDataLoader(dataset, batch_size=batch_size)

for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
    print(X_batch.shape)
    print(y_batch.shape)
    print(asset_names_batch)
    print(list(map(str, timestamps_batch)))
    print()

torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-15 00:00:00', '2016-01-19 00:00:00', '2016-01-22 00:00:00', '2016-01-13 00:00:00']

torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-14 00:00:00', '2016-01-12 00:00:00', '2016-01-11 00:00:00', '2016-01-20 00:00:00']

torch.Size([2, 2, 5, 2])
torch.Size([2, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-21 00:00:00', '2016-01-18 00:00:00']

The big advantage of RigidDataloader is that the one can use it easily for evaluation purposes since the shape of batches is always the same. For example, we can be sure the horizon in the y_batch is going to be identical and therefore the predicted portfolio will be always held for the horizon number of timesteps.

FlexibleDataLoader¶

The goal of this dataloader is to introduce major structural changes to the streamed batches X_batch or y_batch. The goal is to randomly create subtensors of them. See below important features

lookback_range tuple specifies the min and max lookback a X_batch can have. The actual lookback is sampled uniformly for every batch.

horizon_range tuple specifies the min and max horizon a y_batch can have. Sampled uniformly.

If asset_ixs not specified then n_assets_range tuple is the min and max number of assets in X_batch and y_batch. The actual assets sampled randomly.

from deepdow.data import FlexibleDataLoader

torch.manual_seed(3)
batch_size = 4

dataloader = FlexibleDataLoader(dataset,
                                batch_size=batch_size,
                                n_assets_range=(2, 3),  # keep n_assets = 2 but shuffle randomly
                                lookback_range=(2, 6),  # sampled uniformly from [2, 6)
                                horizon_range=(2, 5))   # sampled uniformly from [2, 5)

for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
    print(X_batch.shape)
    print(y_batch.shape)
    print(asset_names_batch)
    print(list(map(str, timestamps_batch)))
    print()

torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 2, 2])
['AAPL', 'MSFT']
['2016-01-20 00:00:00', '2016-01-15 00:00:00', '2016-01-13 00:00:00', '2016-01-22 00:00:00']

torch.Size([4, 2, 4, 2])
torch.Size([4, 2, 2, 2])
['MSFT', 'AAPL']
['2016-01-12 00:00:00', '2016-01-18 00:00:00', '2016-01-11 00:00:00', '2016-01-21 00:00:00']

torch.Size([2, 2, 4, 2])
torch.Size([2, 2, 3, 2])
['AAPL', 'MSFT']
['2016-01-19 00:00:00', '2016-01-14 00:00:00']

The main purpose of this dataloader is to use it for training. One can design networks that can perform a forward pass of an input X with variable shapes (i.e. RNN over the time dimension). This is where FlexibleDataLoader comes in handy because it can stream these variable inputs.

Warning

As an example when not to use FlexibleDataLoader let us consider a dummy network. This network flattens the input tensor into a 1D vector of length n_channels * lookback * n_assets. Afterwards, it applies a linear layer and finally uses some allocation layer (softmax). In this case, one cannot just stream tensors of different sizes. Additionally, if we randomly shuffle the order of assets (while keeping the overall number equal to n_assets) the linear model will have no way of learning asset specific features.