Data Loading¶
Introduction¶
deepdow offers multiple utility functions and classes that turn raw data into tensors used by Layers
and Losses.
See below a scheme of the overall datamodel (starting at the top)
We dedicate an entire section to each of the elements.
Raw data¶
Let us assume, that our raw data raw_df is stored in a pd.DataFrame. There are n_timesteps rows
representing different timesteps with the same time frequency but potentially with gaps (due to non-business days etc.).
They are indexed by pd.DatetimeIndex. The columns are indexed by pd.MultiIndex where the first level
represents the the n_assets different assets. The second level then represents
the n_channels channels (indicators) like volume or close price. For the rest of the this
page we will be using the below example
Asset |
MSFT |
MSFT |
AAPL |
AAPL |
|---|---|---|---|---|
Channel |
Close |
Volume |
Close |
Volume |
2016-01-04 00:00:00 |
54.80 |
53778000.00 |
105.35 |
67649400.00 |
2016-01-05 00:00:00 |
55.05 |
34079700.00 |
102.71 |
55791000.00 |
2016-01-06 00:00:00 |
54.05 |
39518900.00 |
100.70 |
68457400.00 |
2016-01-07 00:00:00 |
52.17 |
56564900.00 |
96.45 |
81094400.00 |
2016-01-08 00:00:00 |
52.33 |
48754000.00 |
96.96 |
70798000.00 |
2016-01-11 00:00:00 |
52.30 |
36943800.00 |
98.53 |
49739400.00 |
2016-01-12 00:00:00 |
52.78 |
36095500.00 |
99.96 |
49154200.00 |
2016-01-13 00:00:00 |
51.64 |
66883600.00 |
97.39 |
62439600.00 |
2016-01-14 00:00:00 |
53.11 |
52381900.00 |
99.52 |
63170100.00 |
2016-01-15 00:00:00 |
50.99 |
71820700.00 |
97.13 |
79833900.00 |
2016-01-19 00:00:00 |
50.56 |
43564500.00 |
96.66 |
53087700.00 |
2016-01-20 00:00:00 |
50.79 |
63273000.00 |
96.79 |
72334400.00 |
2016-01-21 00:00:00 |
50.48 |
40191200.00 |
96.30 |
52161500.00 |
2016-01-22 00:00:00 |
52.29 |
37555800.00 |
101.42 |
65800500.00 |
2016-01-25 00:00:00 |
51.79 |
34707700.00 |
99.44 |
51794500.00 |
2016-01-26 00:00:00 |
52.17 |
28900800.00 |
99.99 |
75077000.00 |
2016-01-27 00:00:00 |
51.22 |
36775200.00 |
93.42 |
133369700.00 |
2016-01-28 00:00:00 |
52.06 |
62513800.00 |
94.09 |
55678800.00 |
2016-01-29 00:00:00 |
55.09 |
83611700.00 |
97.34 |
64416500.00 |
2016-02-01 00:00:00 |
54.71 |
44208500.00 |
96.43 |
40943500.00 |
assert isinstance(raw_df, pd.DataFrame)
assert isinstance(raw_df.index, pd.DatetimeIndex)
assert isinstance(raw_df.columns, pd.MultiIndex)
assert raw_df.shape == (20, 4)
raw_to_Xy¶
The quickest way to get going given raw_df is to use the deepdow.utils.raw_to_Xy function.
It performs the following steps
exclusion of undesired assets and channels (see
included_assetsandincluded_indicators)adding missing rows - timestamps implied by the specified frequency
freqfilling missing values (forward fill followed by backward fill)
computation of returns (if
use_logthen logarithmic else simple) - the first timestep is automatically deletedrunning the rolling window (see Basics) given
lookback,gapandhorizon
We get the following outputs
X- numpy array of shape(n_samples, n_channels, lookback, n_assets)representing featurestimestamps- list of lengthn_samplesrepresenting timestamp of each sampley- numpy array of shape(n_samples, n_channels, horizon, n_assets)representing targetsasset_names- list of lengthn_assetsrepresenting asset namesindicators- list of lengthn_channelsrepresenting channel / indicator names
Note that in our example n_samples = n_timesteps - lookback - horizon - gap + 1 since there is a single
missing day (2016-01-18) w.r.t. the default B frequency that is going to be forward filled.
from deepdow.utils import raw_to_Xy
n_timesteps = len(raw_df) # 20
n_channels = len(raw_df.columns.levels[0]) # 2
n_assets = len(raw_df.columns.levels[1]) # 2
lookback, gap, horizon = 5, 2, 4
X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
lookback=lookback,
gap=gap,
freq="B",
horizon=horizon)
n_samples = n_timesteps - lookback - horizon - gap + 1 # 10
assert X.shape == (n_samples, n_channels, lookback, n_assets)
assert timestamps[0] == raw_df.index[lookback]
assert asset_names == ['AAPL', 'MSFT']
assert indicators == ['Close', 'Volume']
InRAMDataset¶
The next step is to start migrating our custom lists and numpy arrays to native PyTorch classes. For more details see
Official tutorial. First of all,
deepdow implements its own subclass of torch.utils.data.Dataset called InRAMDataset. Its goal
is to encapsulate the above generated X, y, timestamps and asset_names and define
per sample loading.
from deepdow.data import InRAMDataset
dataset = InRAMDataset(X, y, timestamps=timestamps, asset_names=asset_names)
X_sample, y_sample, timestamp_sample, asset_names = dataset[0]
assert isinstance(dataset, torch.utils.data.Dataset)
assert len(dataset) == 10
assert torch.is_tensor(X_sample)
assert X_sample.shape == (2, 5, 2) # (n_channels, lookback, n_assets)
assert torch.is_tensor(y_sample)
assert y_sample.shape == (2, 4, 2) # (n_channels, horizon, n_assets)
assert timestamp_sample == timestamps[0]
Additionally, one can pass a transformation transform that can serve as preprocessing or data augmentation.
Currently implemented transforms under deepdow.data are
Compose- basically a copy of Compose from Torch VisionDropout- randomly setting elements to zeroMultiply- multiplying all elements by a constantNoise- add Gaussian noiseScale- centering and scaling (similar to scikit-learnStandardScalerandRobustScaler)
All of the transforms are not in place.
Dataloaders¶
The last ingredient in the data pipeline are dataloaders. Their goal is to stream batches of samples for training and
validation. deepdow provides two options
RigidDataLoader - lookback, horizon and assets are constant over different batches
FlexibleDataLoader - lookback, horizon and assets can change over different batches
Both of them are subclassing torch.utils.data.DataLoader and therefore inherit its functionality. One important
example is the batch_size parameter. However, they also add new functionality. Notably one can use the
parameter indices to specify which samples of the original dataset are going to be streamed. The
train, validation and test split can be performed via this parameter. Last but not least they both have its
specific parameters that we describe in the following subsections.
RigidDataLoader¶
This dataloader streams batches without making fundamental changes to X_batch or y_batch.
The samples are shuffled
The shapes are
X_batch.shape = (batch_size, n_channels, lookback, n_assets)
y_batch.shape = (batch_size, n_channels, horizon, n_assets)
len(timestamps_batch) = batch_size
len(asset_names_batch) = n_assetsat construction one can redefine
lookback,horizonandasset_ixsto create a new subset
from deepdow.data import RigidDataLoader
torch.manual_seed(1)
batch_size = 4
dataloader = RigidDataLoader(dataset, batch_size=batch_size)
for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
print(X_batch.shape)
print(y_batch.shape)
print(asset_names_batch)
print(list(map(str, timestamps_batch)))
print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-15 00:00:00', '2016-01-19 00:00:00', '2016-01-22 00:00:00', '2016-01-13 00:00:00']
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-14 00:00:00', '2016-01-12 00:00:00', '2016-01-11 00:00:00', '2016-01-20 00:00:00']
torch.Size([2, 2, 5, 2])
torch.Size([2, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-21 00:00:00', '2016-01-18 00:00:00']
The big advantage of RigidDataloader is that the one can use it easily for evaluation purposes since
the shape of batches is always the same. For example, we can be sure the horizon in the y_batch
is going to be identical and therefore the predicted portfolio will be always held for the horizon number
of timesteps.
FlexibleDataLoader¶
The goal of this dataloader is to introduce major structural changes to the streamed batches X_batch or
y_batch. The goal is to randomly create subtensors of them. See below important features
lookback_rangetuple specifies the min and max lookback aX_batchcan have. The actual lookback is sampled uniformly for every batch.
horizon_rangetuple specifies the min and max horizon ay_batchcan have. Sampled uniformly.If
asset_ixsnot specified thenn_assets_rangetuple is the min and max number of assets inX_batchandy_batch. The actual assets sampled randomly.
from deepdow.data import FlexibleDataLoader
torch.manual_seed(3)
batch_size = 4
dataloader = FlexibleDataLoader(dataset,
batch_size=batch_size,
n_assets_range=(2, 3), # keep n_assets = 2 but shuffle randomly
lookback_range=(2, 6), # sampled uniformly from [2, 6)
horizon_range=(2, 5)) # sampled uniformly from [2, 5)
for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
print(X_batch.shape)
print(y_batch.shape)
print(asset_names_batch)
print(list(map(str, timestamps_batch)))
print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 2, 2])
['AAPL', 'MSFT']
['2016-01-20 00:00:00', '2016-01-15 00:00:00', '2016-01-13 00:00:00', '2016-01-22 00:00:00']
torch.Size([4, 2, 4, 2])
torch.Size([4, 2, 2, 2])
['MSFT', 'AAPL']
['2016-01-12 00:00:00', '2016-01-18 00:00:00', '2016-01-11 00:00:00', '2016-01-21 00:00:00']
torch.Size([2, 2, 4, 2])
torch.Size([2, 2, 3, 2])
['AAPL', 'MSFT']
['2016-01-19 00:00:00', '2016-01-14 00:00:00']
The main purpose of this dataloader is to use it for training. One can design networks that can perform
a forward pass of an input X with variable shapes (i.e. RNN over the time dimension). This is where
FlexibleDataLoader comes in handy because it can stream these variable inputs.
Warning
As an example when not to use FlexibleDataLoader let us consider a dummy network. This
network flattens the input tensor into a 1D vector of length n_channels * lookback * n_assets. Afterwards,
it applies a linear layer and finally uses some allocation layer (softmax). In this case, one cannot just
stream tensors of different sizes. Additionally, if we randomly shuffle the order of assets (while keeping the overall
number equal to n_assets) the linear model will have no way of learning asset specific features.