Data Loading

Introduction

deepdow offers multiple utility functions and classes that turn raw data into tensors used by Layers and Losses.

See below a scheme of the overall datamodel (starting at the top)

https://i.imgur.com/Q8Tgnb5.png

We dedicate an entire section to each of the elements.

Raw data

Let us assume, that our raw data raw_df is stored in a pd.DataFrame. There are n_timesteps rows representing different timesteps with the same time frequency but potentially with gaps (due to non-business days etc.). They are indexed by pd.DatetimeIndex. The columns are indexed by pd.MultiIndex where the first level represents the the n_assets different assets. The second level then represents the n_channels channels (indicators) like volume or close price. For the rest of the this page we will be using the below example

Asset

MSFT

MSFT

AAPL

AAPL

Channel

Close

Volume

Close

Volume

2016-01-04 00:00:00

54.80

53778000.00

105.35

67649400.00

2016-01-05 00:00:00

55.05

34079700.00

102.71

55791000.00

2016-01-06 00:00:00

54.05

39518900.00

100.70

68457400.00

2016-01-07 00:00:00

52.17

56564900.00

96.45

81094400.00

2016-01-08 00:00:00

52.33

48754000.00

96.96

70798000.00

2016-01-11 00:00:00

52.30

36943800.00

98.53

49739400.00

2016-01-12 00:00:00

52.78

36095500.00

99.96

49154200.00

2016-01-13 00:00:00

51.64

66883600.00

97.39

62439600.00

2016-01-14 00:00:00

53.11

52381900.00

99.52

63170100.00

2016-01-15 00:00:00

50.99

71820700.00

97.13

79833900.00

2016-01-19 00:00:00

50.56

43564500.00

96.66

53087700.00

2016-01-20 00:00:00

50.79

63273000.00

96.79

72334400.00

2016-01-21 00:00:00

50.48

40191200.00

96.30

52161500.00

2016-01-22 00:00:00

52.29

37555800.00

101.42

65800500.00

2016-01-25 00:00:00

51.79

34707700.00

99.44

51794500.00

2016-01-26 00:00:00

52.17

28900800.00

99.99

75077000.00

2016-01-27 00:00:00

51.22

36775200.00

93.42

133369700.00

2016-01-28 00:00:00

52.06

62513800.00

94.09

55678800.00

2016-01-29 00:00:00

55.09

83611700.00

97.34

64416500.00

2016-02-01 00:00:00

54.71

44208500.00

96.43

40943500.00

assert isinstance(raw_df, pd.DataFrame)
assert isinstance(raw_df.index, pd.DatetimeIndex)
assert isinstance(raw_df.columns, pd.MultiIndex)
assert raw_df.shape == (20, 4)

raw_to_Xy

The quickest way to get going given raw_df is to use the deepdow.utils.raw_to_Xy function. It performs the following steps

  1. exclusion of undesired assets and channels (see included_assets and included_indicators)

  2. adding missing rows - timestamps implied by the specified frequency freq

  3. filling missing values (forward fill followed by backward fill)

  4. computation of returns (if use_log then logarithmic else simple) - the first timestep is automatically deleted

  5. running the rolling window (see Basics) given lookback, gap and horizon

We get the following outputs

  • X - numpy array of shape (n_samples, n_channels, lookback, n_assets) representing features

  • timestamps- list of length n_samples representing timestamp of each sample

  • y - numpy array of shape (n_samples, n_channels, horizon, n_assets) representing targets

  • asset_names - list of length n_assets representing asset names

  • indicators - list of length n_channels representing channel / indicator names

Note that in our example n_samples = n_timesteps - lookback - horizon - gap + 1 since there is a single missing day (2016-01-18) w.r.t. the default B frequency that is going to be forward filled.

from deepdow.utils import raw_to_Xy


n_timesteps = len(raw_df)  # 20
n_channels = len(raw_df.columns.levels[0])  # 2
n_assets = len(raw_df.columns.levels[1])  # 2

lookback, gap, horizon = 5, 2, 4

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      freq="B",
                                                      horizon=horizon)

n_samples =  n_timesteps - lookback - horizon - gap + 1  # 10

assert X.shape == (n_samples, n_channels, lookback, n_assets)
assert timestamps[0] == raw_df.index[lookback]
assert asset_names == ['AAPL', 'MSFT']
assert indicators == ['Close', 'Volume']

InRAMDataset

The next step is to start migrating our custom lists and numpy arrays to native PyTorch classes. For more details see Official tutorial. First of all, deepdow implements its own subclass of torch.utils.data.Dataset called InRAMDataset. Its goal is to encapsulate the above generated X, y, timestamps and asset_names and define per sample loading.

from deepdow.data import InRAMDataset

dataset = InRAMDataset(X, y, timestamps=timestamps, asset_names=asset_names)

X_sample, y_sample, timestamp_sample, asset_names = dataset[0]

assert isinstance(dataset, torch.utils.data.Dataset)
assert len(dataset) == 10

assert torch.is_tensor(X_sample)
assert X_sample.shape == (2, 5, 2)  # (n_channels, lookback, n_assets)

assert torch.is_tensor(y_sample)
assert y_sample.shape == (2, 4, 2)  # (n_channels, horizon, n_assets)

assert timestamp_sample == timestamps[0]

Additionally, one can pass a transformation transform that can serve as preprocessing or data augmentation. Currently implemented transforms under deepdow.data are

  • Compose - basically a copy of Compose from Torch Vision

  • Dropout - randomly setting elements to zero

  • Multiply - multiplying all elements by a constant

  • Noise - add Gaussian noise

  • Scale - centering and scaling (similar to scikit-learn StandardScaler and RobustScaler)

All of the transforms are not in place.

Dataloaders

The last ingredient in the data pipeline are dataloaders. Their goal is to stream batches of samples for training and validation. deepdow provides two options

  • RigidDataLoader - lookback, horizon and assets are constant over different batches

  • FlexibleDataLoader - lookback, horizon and assets can change over different batches

Both of them are subclassing torch.utils.data.DataLoader and therefore inherit its functionality. One important example is the batch_size parameter. However, they also add new functionality. Notably one can use the parameter indices to specify which samples of the original dataset are going to be streamed. The train, validation and test split can be performed via this parameter. Last but not least they both have its specific parameters that we describe in the following subsections.

RigidDataLoader

This dataloader streams batches without making fundamental changes to X_batch or y_batch.

  • The samples are shuffled

  • The shapes are

    • X_batch.shape = (batch_size, n_channels, lookback, n_assets)

    • y_batch.shape = (batch_size, n_channels, horizon, n_assets)

    • len(timestamps_batch) = batch_size

    • len(asset_names_batch) = n_assets

  • at construction one can redefine lookback, horizon and asset_ixs to create a new subset

from deepdow.data import RigidDataLoader

torch.manual_seed(1)
batch_size = 4

dataloader = RigidDataLoader(dataset, batch_size=batch_size)

for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
    print(X_batch.shape)
    print(y_batch.shape)
    print(asset_names_batch)
    print(list(map(str, timestamps_batch)))
    print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-15 00:00:00', '2016-01-19 00:00:00', '2016-01-22 00:00:00', '2016-01-13 00:00:00']

torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-14 00:00:00', '2016-01-12 00:00:00', '2016-01-11 00:00:00', '2016-01-20 00:00:00']

torch.Size([2, 2, 5, 2])
torch.Size([2, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-21 00:00:00', '2016-01-18 00:00:00']

The big advantage of RigidDataloader is that the one can use it easily for evaluation purposes since the shape of batches is always the same. For example, we can be sure the horizon in the y_batch is going to be identical and therefore the predicted portfolio will be always held for the horizon number of timesteps.

FlexibleDataLoader

The goal of this dataloader is to introduce major structural changes to the streamed batches X_batch or y_batch. The goal is to randomly create subtensors of them. See below important features

  • lookback_range tuple specifies the min and max lookback a X_batch can have. The actual lookback is sampled uniformly for every batch.

  • horizon_range tuple specifies the min and max horizon a y_batch can have. Sampled uniformly.

  • If asset_ixs not specified then n_assets_range tuple is the min and max number of assets in X_batch and y_batch. The actual assets sampled randomly.

from deepdow.data import FlexibleDataLoader

torch.manual_seed(3)
batch_size = 4

dataloader = FlexibleDataLoader(dataset,
                                batch_size=batch_size,
                                n_assets_range=(2, 3),  # keep n_assets = 2 but shuffle randomly
                                lookback_range=(2, 6),  # sampled uniformly from [2, 6)
                                horizon_range=(2, 5))   # sampled uniformly from [2, 5)

for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
    print(X_batch.shape)
    print(y_batch.shape)
    print(asset_names_batch)
    print(list(map(str, timestamps_batch)))
    print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 2, 2])
['AAPL', 'MSFT']
['2016-01-20 00:00:00', '2016-01-15 00:00:00', '2016-01-13 00:00:00', '2016-01-22 00:00:00']

torch.Size([4, 2, 4, 2])
torch.Size([4, 2, 2, 2])
['MSFT', 'AAPL']
['2016-01-12 00:00:00', '2016-01-18 00:00:00', '2016-01-11 00:00:00', '2016-01-21 00:00:00']

torch.Size([2, 2, 4, 2])
torch.Size([2, 2, 3, 2])
['AAPL', 'MSFT']
['2016-01-19 00:00:00', '2016-01-14 00:00:00']

The main purpose of this dataloader is to use it for training. One can design networks that can perform a forward pass of an input X with variable shapes (i.e. RNN over the time dimension). This is where FlexibleDataLoader comes in handy because it can stream these variable inputs.

Warning

As an example when not to use FlexibleDataLoader let us consider a dummy network. This network flattens the input tensor into a 1D vector of length n_channels * lookback * n_assets. Afterwards, it applies a linear layer and finally uses some allocation layer (softmax). In this case, one cannot just stream tensors of different sizes. Additionally, if we randomly shuffle the order of assets (while keeping the overall number equal to n_assets) the linear model will have no way of learning asset specific features.