Layers¶
Introduction¶
As described in Basics our goal is to construct a network, that inputs a 3D tensor x of shape
(n_channels, lookback, n_assets)
and outputs a 1D tensor w of shape (n_assets,)
. One can achieve
this task by creating a pipeline of layers. See below an example of a pipeline
L1 - 1D convolution shared across assets, no change in dimensionality
L2 - mean over the channels (3D -> 2D)
L3 - maximum over timesteps (2D -> 1D)
L4 - covariance matrix of columns of h2
L5 - given h3 and h4 solves convex optimization problem
deepdow
groups all custom layers into 4 categories:
- Transform
Feature extractors that do not change the dimensionality of the input tensor. L1 in the example.
- Collapse
Remove an entire dimension of the input via some aggregation scheme. L2 and L3 in the example.
- Allocate
Given input tensors these layers generate final portfolio weights. L5 in the example.
- Misc
Helper layers. L4 in the example.
Note that all custom layers are simply subclasses of torch.nn.Module
and one can freely use them together
with official PyTorch layers.
Warning
Almost all deepdow
layers assume that the input and output tensors have an extra dimension
in the front—the sample dimension. We often omit this dimension on purpose to make the examples
and sketches simpler.
Transform layers¶
Transform layers are supposed to extract useful features from input tensors. For the exact usage see deepdow.layers.transform module.
Conv¶
This layer supports both 1D
and 2D
convolution controlled via the method
parameter.
In the forward pass we need to provide tensors of shape (n_samples, n_input_channels, lookback)
resp.
(n_samples, n_input_channels, lookback, n_assets)
. The padding is automatically implied by kernel_size
such that the output tensor has the same size (for odd kernel_size
exactly, for even approximately).
from deepdow.layers import Conv
n_samples, n_input_channels, lookback, n_assets = 2, 4, 20, 11
n_output_channels = 8
x = torch.rand(n_samples, n_input_channels, lookback, n_assets)
layer = Conv(n_input_channels=n_input_channels,
n_output_channels=n_output_channels,
kernel_size=3,
method='1D')
# Apply the same Conv1D layer to all assets
result = torch.stack([layer(x[..., i]) for i in range(n_assets)], dim=-1)
assert result.shape == (n_samples, n_output_channels, lookback, n_assets)
RNN¶
This layer runs the same recurrent network over all assets and then stacks the hidden layers back together.
It provides both the standard RNN
as well as LSTM
. The choice is controlled
via the parameter cell_type
. The user specifies the number of output channels via hidden_size
. This
number corresponds to the actual hidden state dimensionality if bidirectional=False
otherwise it is one half of
it.
from deepdow.layers import RNN
n_samples, n_input_channels, lookback, n_assets = 2, 4, 20, 11
hidden_size = 8
x = torch.rand(n_samples, n_input_channels, lookback, n_assets)
layer = RNN(n_channels=n_input_channels,
hidden_size=hidden_size,
cell_type='LSTM')
result = layer(x)
assert result.shape == (n_samples, n_output_channels, lookback, n_assets)
Warp¶
This layer is inspired by the problem of time series alignment (see [Weber2019]).
It allows the user to specify per asset 1D transformations to warp the input tensor x with.
Note that Zoom is a special case. The tform
tensor should mostly have values
between (-1, 1) where -1 represents the beginning of the time series and 1 represents the end
(the most recent observations). This layer has two modes based on the shape of provided
tform
.
tform.shape = (n_samples, lookback, n_assets)
- Warping each asset differentlytform.shape = (n_samples, lookback)
- Warping each asset the same way
from deepdow.layers import Warp
n_samples, n_channels, lookback, n_assets = 2, 4, 20, 11
x = torch.rand(n_samples, n_channels, lookback, n_assets)
single_tform = (torch.linspace(0, end=1, steps=lookback) ** 2 - 0.5) * 2
tform = torch.stack(n_samples * [single_tform], dim=0)
layer = Warp()
result = layer(x, tform)
assert result.shape == (n_samples, n_channels, lookback, n_assets)
Note that to prevent folding one should provide strictly monotonic transformations.
See also
Example Warp layer
Zoom¶
Inspired by the Spatial Transformer Network [Jaderberg2015], this layer allows to dynamically zoom in
and out along the lookback
(time) dimension of the input x. In other words,
it performs dynamic time warping (with linear transformation). By providing
a scale of 1 no changes are made. If provides scale < 1 i.e. 0.5 then the time is slowed down twice
and lookback/2
most recent timesteps are considered. Conversely, if we provide scale > 1
i.e. 2 then the time is sped up twice and 2 * lookback
timesteps are considered. Since
we only have lookback
timesteps available in x we employ padding (see below).
The method
parameter determines what interpolation is used (either 'bilinear'
and
'nearest'
). The parameter padding_method
controls what to do with values that
fall outside of the grid (happens when scale > 1). The options are 'zeros'
, 'border'
and 'reflection'
.
from deepdow.layers import Zoom
n_samples, n_channels, lookback, n_assets = 2, 4, 20, 11
x = torch.rand(n_samples, n_channels, lookback, n_assets)
scale = torch.rand(n_samples) # values between (0, 1) representing slowing down
layer = Zoom()
result = layer(x, scale)
assert result.shape == (n_samples, n_channels, lookback, n_assets)
See also
Example Zoom layer
Collapse layers¶
Transform layers remove entire dimension. For the exact usage see deepdow.layers.collapse module.
AttentionCollapse¶
AverageCollapse¶
ElementCollapse¶
ExponentialCollapse¶
MaxCollapse¶
SumCollapse¶
Allocation layers¶
For the exact usage see deepdow.layers.allocate module.
AnalyticalMarkowitz¶
The AnalyticalMarkowitz
layer has two modes. If the user provides only the covariance matrix
\(\boldsymbol{\Sigma}\), it returns the Minimum variance portfolio. However, if additionally one supplies the
expected return vector \(\boldsymbol{\mu}\) then it computes the Tangency portfolio (also known as the
Maximum Sharpe ratio portfolio). Note that risk free rate is assumed to be zero.
Note that this allocator cannot enforce any additional constraints i.e. maximum weight per asset. For more details and derivations see [LectureNotes].
NCO¶
The NCO
allocator is heavily inspired by Nested Cluster Optimization proposed in [Prado2019]. The main
idea is to group assets into n_clusters
different clusters and use AnalyticalMarkowitz
inside each of
them. In the second step, we compute asset allocation across these n_clusters
new portfolios. Note that
the clustering is currently done via the KMeans
layer (see KMeans).
NumericalMarkowitz¶
While AnalyticalMarkowitz
gives us the benefit of analytical solutions, it does not allow for any additional
constraints. NumericalMarkowitz
is a generic convex optimization solver built on top of cvxpylayers
(see [Agrawal2019] for more details). The statement of the problem is shown below. It is motivated by [Bodnar2013].
The user needs to provide n_assets
(\(N\) in the above formulation) and max_weight
(\(w_{\text{max}}\)) when constructing this layer. To perform a forward pass one passes the following
tensors (batched along the sample dimension):
rets
- Corresponds to the expected returns vector \(\boldsymbol{\mu}\)covmat_sqrt
- Corresponds to a (matrix) square root of the covariance matrix \(\boldsymbol{\Sigma}\)gamma_sqrt
- Corresponds to a square root of \(\gamma\) and controls risk aversionalpha
- Corresponds to \(\alpha\) and determines the regularization power. Internally, its absolute value is used to prevent sign changes.
Warning
The major downside of using this allocator is a significant decrease in speed.
Resample¶
The Resample
layer is inspired by [Michaud2007]. It is a metallocator that expects an instance
base allocator as an input. Currently supported base allocators are:
AnalyticalMarkowitz
NCO
NumericalMarkowitz
The premise of this metaallocator is that \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) are just noisy
estimates of their population counterparts. Parametric boostrapping is therefore applied. We sample
n_portfolios * n_draws
new vectors from the distribution
\(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). We then create estimates
\(\boldsymbol{\mu}_{1}, ...,\boldsymbol{\mu}_{\text{n_portfolios}}\) and
\(\boldsymbol{\Sigma}_{1}, ..., \boldsymbol{\Sigma}_{\text{n_portfolios}}\) and run the base allocator for each of
the pairs. This results in obtaining multiple allocations \(\textbf{w}_{1}, ...,\textbf{w}_{\text{n_portfolios}}\).
The final allocation is simply an average \(\textbf{w} = \sum_{i=1}^{\text{n_portfolios}}\textbf{w}_i\).
SoftmaxAllocator¶
Inspired by portfolio optimization with reinforcement learning (i.e. [Jiang2017]) the SoftmaxAllocator
performs a softmax over the input. Additionally, one can also provide custom temperature
.
Note that one can provide a single temperature
at construction that is shared across all samples. Alternatively,
one can provide per sample temperature when performing the forward pass.
The above formulation (formulation
) is analytical. One can also obtain the same weights
via solving a convex optimization problem (variational formulation). See [Agrawal2019] and
[Martins2017] for more details.
where \(H(\textbf{w})=-\sum_{i=1}^{N} w_i \log(w_i)\) is the entropy. Note that if
max_weight
is set to 1 then one gets the unconstrained (analytical) softmax. The benefit of
using the variational formulation is the fact that the user can decide on any max_weight
from (0, 1]
.
from deepdow.layers import SoftmaxAllocator
layer = SoftmaxAllocator(temperature=None)
x = torch.tensor([[1, 2.3], [2, 4.2]])
temperature = torch.tensor([0.2, 1])
w = layer(x, temperature=temperature)
assert w.shape == (2, 2)
assert torch.allclose(w.sum(1), torch.ones(2))
See also
Example Softmax and Sparsemax
SparsemaxAllocator¶
Suggested in [Martins2016]. It is similar to Softmax but enforces sparsity. It currently uses
cvxpylayers
as a backend. See below a mathematical formulation. note that x represents
the logits.
Similarly to SoftmaxAllocator
one can provide temperature either per sample or a single
one at construction. Additionally, one can control the maximum weight via the max_weight
parameter.
from deepdow.layers import SparsemaxAllocator
n_assets = 3
layer = SparsemaxAllocator(n_assets, temperature=1)
x = torch.tensor([[1, 2.3, 2.1], [2, 4.2, -1.1]])
w = layer(x)
w_true = torch.tensor([[-1.2650e-10, 6.0000e-01, 4.0000e-01],
[-2.9905e-10, 1.0000e+00, 4.2659e-10]])
assert w.shape == (2, 3)
assert torch.allclose(w.sum(1), torch.ones(2))
assert torch.allclose(w, w_true, atol=1e-5)
See also
Example Softmax and Sparsemax
WeightNorm¶
This allocation layer is supposed to be the simplest layer that could be used as a benchmark.
The goal is to fix the number of assets n_assets
and for each of them learn a non-negative
value \(w\_\) that represents the unnormalized weight. The final allocation is then simply
computed as
from deepdow.layers import WeightNorm
n_assets = 5
layer = WeightNorm(n_assets)
x = torch.tensor([[1, 2.3, 2.1], [2, 4.2, -1.1]])
w = layer(x)
assert torch.allclose(w.sum(1), torch.ones(2))
assert torch.allclose(w[0], w[1])
Misc layers¶
For the exact usage see deepdow.layers.misc module.
Cov2Corr¶
Conversion of a covariance matrix into a correlation matrix.
from deepdow.layers import Cov2Corr
layer = Cov2Corr()
covmat = torch.tensor([[[4, 3], [3, 9.0]]])
corrmat = layer(covmat)
assert torch.allclose(corrmat, torch.tensor([[[1.0, 0.5], [0.5, 1.0]]]))
CovarianceMatrix¶
Computes a sample covariance matrix. One can also apply shrinkage, i.e.
The \(F\) is a highly structured matrix whereas \(S\) is the sample covariance matrix.
The constant \(\delta\) (shrinkage_coef
in the constructor) determines how
we weigh the two matrices. See [Ledoit2004] for additional background. deepdow
offers
multiple preset matrices \(F\) that can be controlled via the shrinkage_strategy
parameter.
None
- no shrinkage applied (can lead to non-PSD matrix)diagonal
- diagonal of \(S\) with off-diagonal elements being zeroidentity
- identity matrixscaled-identity
- diagonal filled with average variance in \(S\) and off-diagonal elements set to zero
After performing shrinkage, one can also compute the (matrix) square root of the shrinked matrix. This is controlled
by the boolean sqrt
.
Note
One can also omit the shrinkage_coef
in the constructor (shrinkage_coef=None
) and
pass it dynamically as a torch.Tensor
during a forward pass.
from deepdow.layers import CovarianceMatrix
torch.manual_seed(3)
x = torch.rand(1, 10, 3) * 100
layer = CovarianceMatrix(sqrt=False)
layer_sqrt = CovarianceMatrix(sqrt=True)
covmat = layer(x)
covmat_sqrt = layer_sqrt(x)
assert torch.allclose(covmat[0], covmat_sqrt[0] @ covmat_sqrt[0], atol=1e-2)
KMeans¶
A version of the well-known clustering algorithm. The deepdow
interface is very similar to the one of
scikit-learn [sklearnkmeans]. Most importantly, one needs to decide on the n_clusters
.
from deepdow.layers import KMeans
x = torch.tensor([[0, 0], [0.5, 0], [0.5, 1], [1, 1.0]])
manual_init = torch.tensor([[0, 0], [1, 1]])
kmeans_layer = KMeans(n_clusters=2, init='manual')
cluster_ixs, cluster_centers = kmeans_layer(x, manual_init=manual_init)
assert torch.allclose(cluster_ixs, torch.tensor([0, 0, 1, 1]))
Warning
This layer does not include additional (sample) dimension. Batching can be implemented by a naive for loop and stacking.
References¶
- LectureNotes
http://faculty.washington.edu/ezivot/econ424/portfolioTheoryMatrix.pdf
- Prado2019
Lopez de Prado, M. (2019). A Robust Estimator of the Efficient Frontier. Available at SSRN 3469961.
- Jiang2017
Jiang, Zhengyao, and Jinjun Liang. “Cryptocurrency portfolio management with deep reinforcement learning.” 2017 Intelligent Systems Conference (IntelliSys). IEEE, 2017
- Weber2019
Weber, Ron A. Shapira, et al. “Diffeomorphic Temporal Alignment Nets.” Advances in Neural Information Processing Systems. 2019.
- Agrawal2019(1,2)
Agrawal, Akshay, et al. “Differentiable convex optimization layers.” Advances in Neural Information Processing Systems. 2019.
- Michaud2007
Michaud, Richard O., and Robert Michaud. “Estimation error and portfolio optimization: a resampling solution.” Available at SSRN 2658657 (2007).
- Martins2016
Martins, Andre, and Ramon Astudillo. “From softmax to sparsemax: A sparse model of attention and multi-label classification.” International Conference on Machine Learning. 2016.
- Ledoit2004
Ledoit, Olivier, and Michael Wolf. “Honey, I shrunk the sample covariance matrix.” The Journal of Portfolio Management 30.4 (2004): 110-119.
- sklearnkmeans
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- Martins2017
Martins, André FT, and Julia Kreutzer. “Learning what’s easy: Fully differentiable neural easy-first taggers.” Proceedings of the 2017 conference on empirical methods in natural language processing. 2017.
- Bodnar2013
Bodnar, Taras, Nestor Parolya, and Wolfgang Schmid. “On the equivalence of quadratic optimization problems commonly used in portfolio theory.” European Journal of Operational Research 229.3 (2013): 637-644.
- Jaderberg2015
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in neural information processing systems. 2015.