When labels for samples are created without a fixed horizon (e.g. triple barrier labeling method), they each span a different period. These samples can therefore overlap with other samples in various degrees. Samples that do not overlap much with other samples are more unique and are therefore more interesting for the model to look at. This becomes more relevant for machine learning models which bootstrap the training data by random sampling from the dataset, however samples are bootstrapped according to a uniform distribution. This implies that samples that overlap much with other samples are as likely to be sampled as more unique samples that do not overlap with other samples. Ideally, we would therefore like to bootstrap the samples according to their uniqueness to get a more diverse bootstrapped dataset.
Number of Concurrent Events
To calculate the average uniqueness, we first have to calculate the number of concurrent events. This can be calculated with an indicator matrix.
the rows represent time periods
and the columns represent the samples
In the above example the
To calculate the number of concurrent events we sum over the columns
, in this example the number of concurrent events are
The uniqueness for sample
can be calculated with
In this example, the uniqueness matrix is
To calculate the average uniqueness we take the average of the uniqueness values over
In our example, the average uniquenesses of the samples are 0.5, 0.83, and 0.5 respectively.