Data Loading¶
Introduction¶
deepdow
offers multiple utility functions and classes that turn raw data into tensors used by Layers
and Losses.
See below a scheme of the overall datamodel (starting at the top)
We dedicate an entire section to each of the elements.
Raw data¶
Let us assume, that our raw data raw_df
is stored in a pd.DataFrame
. There are n_timesteps
rows
representing different timesteps with the same time frequency but potentially with gaps (due to non-business days etc.).
They are indexed by pd.DatetimeIndex
. The columns are indexed by pd.MultiIndex
where the first level
represents the the n_assets
different assets. The second level then represents
the n_channels
channels (indicators) like volume or close price. For the rest of the this
page we will be using the below example
Asset |
MSFT |
MSFT |
AAPL |
AAPL |
---|---|---|---|---|
Channel |
Close |
Volume |
Close |
Volume |
2016-01-04 00:00:00 |
54.80 |
53778000.00 |
105.35 |
67649400.00 |
2016-01-05 00:00:00 |
55.05 |
34079700.00 |
102.71 |
55791000.00 |
2016-01-06 00:00:00 |
54.05 |
39518900.00 |
100.70 |
68457400.00 |
2016-01-07 00:00:00 |
52.17 |
56564900.00 |
96.45 |
81094400.00 |
2016-01-08 00:00:00 |
52.33 |
48754000.00 |
96.96 |
70798000.00 |
2016-01-11 00:00:00 |
52.30 |
36943800.00 |
98.53 |
49739400.00 |
2016-01-12 00:00:00 |
52.78 |
36095500.00 |
99.96 |
49154200.00 |
2016-01-13 00:00:00 |
51.64 |
66883600.00 |
97.39 |
62439600.00 |
2016-01-14 00:00:00 |
53.11 |
52381900.00 |
99.52 |
63170100.00 |
2016-01-15 00:00:00 |
50.99 |
71820700.00 |
97.13 |
79833900.00 |
2016-01-19 00:00:00 |
50.56 |
43564500.00 |
96.66 |
53087700.00 |
2016-01-20 00:00:00 |
50.79 |
63273000.00 |
96.79 |
72334400.00 |
2016-01-21 00:00:00 |
50.48 |
40191200.00 |
96.30 |
52161500.00 |
2016-01-22 00:00:00 |
52.29 |
37555800.00 |
101.42 |
65800500.00 |
2016-01-25 00:00:00 |
51.79 |
34707700.00 |
99.44 |
51794500.00 |
2016-01-26 00:00:00 |
52.17 |
28900800.00 |
99.99 |
75077000.00 |
2016-01-27 00:00:00 |
51.22 |
36775200.00 |
93.42 |
133369700.00 |
2016-01-28 00:00:00 |
52.06 |
62513800.00 |
94.09 |
55678800.00 |
2016-01-29 00:00:00 |
55.09 |
83611700.00 |
97.34 |
64416500.00 |
2016-02-01 00:00:00 |
54.71 |
44208500.00 |
96.43 |
40943500.00 |
assert isinstance(raw_df, pd.DataFrame)
assert isinstance(raw_df.index, pd.DatetimeIndex)
assert isinstance(raw_df.columns, pd.MultiIndex)
assert raw_df.shape == (20, 4)
raw_to_Xy¶
The quickest way to get going given raw_df
is to use the deepdow.utils.raw_to_Xy
function.
It performs the following steps
exclusion of undesired assets and channels (see
included_assets
andincluded_indicators
)adding missing rows - timestamps implied by the specified frequency
freq
filling missing values (forward fill followed by backward fill)
computation of returns (if
use_log
then logarithmic else simple) - the first timestep is automatically deletedrunning the rolling window (see Basics) given
lookback
,gap
andhorizon
We get the following outputs
X
- numpy array of shape(n_samples, n_channels, lookback, n_assets)
representing featurestimestamps
- list of lengthn_samples
representing timestamp of each sampley
- numpy array of shape(n_samples, n_channels, horizon, n_assets)
representing targetsasset_names
- list of lengthn_assets
representing asset namesindicators
- list of lengthn_channels
representing channel / indicator names
Note that in our example n_samples = n_timesteps - lookback - horizon - gap + 1
since there is a single
missing day (2016-01-18) w.r.t. the default B
frequency that is going to be forward filled.
from deepdow.utils import raw_to_Xy
n_timesteps = len(raw_df) # 20
n_channels = len(raw_df.columns.levels[0]) # 2
n_assets = len(raw_df.columns.levels[1]) # 2
lookback, gap, horizon = 5, 2, 4
X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
lookback=lookback,
gap=gap,
freq="B",
horizon=horizon)
n_samples = n_timesteps - lookback - horizon - gap + 1 # 10
assert X.shape == (n_samples, n_channels, lookback, n_assets)
assert timestamps[0] == raw_df.index[lookback]
assert asset_names == ['AAPL', 'MSFT']
assert indicators == ['Close', 'Volume']
InRAMDataset¶
The next step is to start migrating our custom lists and numpy arrays to native PyTorch classes. For more details see
Official tutorial. First of all,
deepdow
implements its own subclass of torch.utils.data.Dataset
called InRAMDataset
. Its goal
is to encapsulate the above generated X
, y
, timestamps
and asset_names
and define
per sample loading.
from deepdow.data import InRAMDataset
dataset = InRAMDataset(X, y, timestamps=timestamps, asset_names=asset_names)
X_sample, y_sample, timestamp_sample, asset_names = dataset[0]
assert isinstance(dataset, torch.utils.data.Dataset)
assert len(dataset) == 10
assert torch.is_tensor(X_sample)
assert X_sample.shape == (2, 5, 2) # (n_channels, lookback, n_assets)
assert torch.is_tensor(y_sample)
assert y_sample.shape == (2, 4, 2) # (n_channels, horizon, n_assets)
assert timestamp_sample == timestamps[0]
Additionally, one can pass a transformation transform
that can serve as preprocessing or data augmentation.
Currently implemented transforms under deepdow.data
are
Compose
- basically a copy of Compose from Torch VisionDropout
- randomly setting elements to zeroMultiply
- multiplying all elements by a constantNoise
- add Gaussian noiseScale
- centering and scaling (similar to scikit-learnStandardScaler
andRobustScaler
)
All of the transforms are not in place.
Dataloaders¶
The last ingredient in the data pipeline are dataloaders. Their goal is to stream batches of samples for training and
validation. deepdow
provides two options
RigidDataLoader - lookback, horizon and assets are constant over different batches
FlexibleDataLoader - lookback, horizon and assets can change over different batches
Both of them are subclassing torch.utils.data.DataLoader
and therefore inherit its functionality. One important
example is the batch_size
parameter. However, they also add new functionality. Notably one can use the
parameter indices
to specify which samples of the original dataset are going to be streamed. The
train, validation and test split can be performed via this parameter. Last but not least they both have its
specific parameters that we describe in the following subsections.
RigidDataLoader¶
This dataloader streams batches without making fundamental changes to X_batch
or y_batch
.
The samples are shuffled
The shapes are
X_batch.shape = (batch_size, n_channels, lookback, n_assets)
y_batch.shape = (batch_size, n_channels, horizon, n_assets)
len(timestamps_batch) = batch_size
len(asset_names_batch) = n_assets
at construction one can redefine
lookback
,horizon
andasset_ixs
to create a new subset
from deepdow.data import RigidDataLoader
torch.manual_seed(1)
batch_size = 4
dataloader = RigidDataLoader(dataset, batch_size=batch_size)
for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
print(X_batch.shape)
print(y_batch.shape)
print(asset_names_batch)
print(list(map(str, timestamps_batch)))
print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-15 00:00:00', '2016-01-19 00:00:00', '2016-01-22 00:00:00', '2016-01-13 00:00:00']
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-14 00:00:00', '2016-01-12 00:00:00', '2016-01-11 00:00:00', '2016-01-20 00:00:00']
torch.Size([2, 2, 5, 2])
torch.Size([2, 2, 4, 2])
['AAPL', 'MSFT']
['2016-01-21 00:00:00', '2016-01-18 00:00:00']
The big advantage of RigidDataloader
is that the one can use it easily for evaluation purposes since
the shape of batches is always the same. For example, we can be sure the horizon
in the y_batch
is going to be identical and therefore the predicted portfolio will be always held for the horizon
number
of timesteps.
FlexibleDataLoader¶
The goal of this dataloader is to introduce major structural changes to the streamed batches X_batch
or
y_batch
. The goal is to randomly create subtensors of them. See below important features
lookback_range
tuple specifies the min and max lookback aX_batch
can have. The actual lookback is sampled uniformly for every batch.
horizon_range
tuple specifies the min and max horizon ay_batch
can have. Sampled uniformly.If
asset_ixs
not specified thenn_assets_range
tuple is the min and max number of assets inX_batch
andy_batch
. The actual assets sampled randomly.
from deepdow.data import FlexibleDataLoader
torch.manual_seed(3)
batch_size = 4
dataloader = FlexibleDataLoader(dataset,
batch_size=batch_size,
n_assets_range=(2, 3), # keep n_assets = 2 but shuffle randomly
lookback_range=(2, 6), # sampled uniformly from [2, 6)
horizon_range=(2, 5)) # sampled uniformly from [2, 5)
for X_batch, y_batch, timestamps_batch, asset_names_batch in dataloader:
print(X_batch.shape)
print(y_batch.shape)
print(asset_names_batch)
print(list(map(str, timestamps_batch)))
print()
torch.Size([4, 2, 5, 2])
torch.Size([4, 2, 2, 2])
['AAPL', 'MSFT']
['2016-01-20 00:00:00', '2016-01-15 00:00:00', '2016-01-13 00:00:00', '2016-01-22 00:00:00']
torch.Size([4, 2, 4, 2])
torch.Size([4, 2, 2, 2])
['MSFT', 'AAPL']
['2016-01-12 00:00:00', '2016-01-18 00:00:00', '2016-01-11 00:00:00', '2016-01-21 00:00:00']
torch.Size([2, 2, 4, 2])
torch.Size([2, 2, 3, 2])
['AAPL', 'MSFT']
['2016-01-19 00:00:00', '2016-01-14 00:00:00']
The main purpose of this dataloader is to use it for training. One can design networks that can perform
a forward pass of an input X
with variable shapes (i.e. RNN over the time dimension). This is where
FlexibleDataLoader
comes in handy because it can stream these variable inputs.
Warning
As an example when not to use FlexibleDataLoader
let us consider a dummy network. This
network flattens the input tensor into a 1D vector of length n_channels * lookback * n_assets
. Afterwards,
it applies a linear layer and finally uses some allocation layer (softmax). In this case, one cannot just
stream tensors of different sizes. Additionally, if we randomly shuffle the order of assets (while keeping the overall
number equal to n_assets
) the linear model will have no way of learning asset specific features.