*Labels* are normal columns in the dataset which have a special usage in case of ML-features.
So what is a ML-feature? It is a normal feature which is defined in terms of some other features and generates one or many new features with new values.
What makes it specific is how ML-features are configured. If normal features are configured by static parameters from their definition,
then ML-features are configured from parameters stored in their *ML-model*.
What is the difference between static parameters and parameters in ML-model?
The main difference is that parameters in normal feature definitions are provided by domain experts and maybe found from some ad-hoc analysis.
For example, we might want to use 20 days moving average and we write this number as a parameter in the feature definition.
If the results are not very good then we could try out some other value like 30 days moving average for this feature.
In case of ML-models, their parameter values are found automatically during a special process called *training*.
It is a special run mode which signals ML-features that they need to generate their parameters themselves by analyzing
the available data and storing the found optimal parameters in the ML-model (as opposed to having parameters in the feature definition).
The ML-feature will detect this 'train' mode, use the available data for finding optimal parameters, and use these parameters
for generating the output. The output of ML-features is called *predictions*.
What is important is that training a ML-algorithm (normally but not always) requires training data represented by
features and true data represented by labels. Labels can be viewed as normal columns of the training dataset which
have a special role during training but are not needed and not used during prediction. In the case of time-series analysis,
labels normally represent future events. For example, a label column can be equal to a future value of some existing feature column like price.
In this case, the label column is generated by shifting the price column so that each row has its current price as a normal feature
and a label as a future value of this same feature. They have to be in the same row in order for the forecasting algorithm
to be able train its parameters. In more complex cases, a label could be some more complex event, for example, whether price
increases by 3% during next 10 days.
Technically, both feature and labels are columns computed from other columns. Their main difference is that features
are computed from past data while labels are computed from future data. In fact, in some cases they could be even implemented as
one feature where the usage of past or future data is controlled by some configuration parameter. For example, we could
implement the computation of moving average column from either past or future data if the window length has a sign:
positive sign means using past data and negative sign of the window parameter means the use of future data
(and hence it will be used as a label).
## Label definitions
Labels are defined exactly as features, that is, each label has some definition with the same structure as feature definitions.
However, since labels have a special role and are needed only in *train* mode, they are defined in a special section:
```jsonc
"label_sets": [
{...}, // First label
{...}, // Second label
{...} // Third label
]
```
In train mode, if `"train": true` in the configuration file, then all columns corresponding to the label definitions will be generated.
And all feature columns and label columns will be used by ML-features to train their models. These models will be then used
to generate output features as predictions.
Label columns are computed by generators and every label definition has `generator` attribute.
ITB has its own built-in generators like `topbot2` and `highlow2` described below.
But it is possible to define a custom label generator in exactly the same way as it is done for custom features.
`topbot2` is a label generator which produces binary values by processing future and past values of one input column.
This input column is normally price but it could be any other numeric column.
For example, it could be a technical indicator like moving average.
The goal is to predict some binary event based on the future and past behavior of this input column.
The input column name is specified in the `columns` attributes, for example: `columns: "price"`.
This label generator returns either true or false depending on whether the current value is a maximum relative to its neighbors or a minimum *relative* to its neigbors. The choise of whether it a return all maxima (as true values) or all minima (also as true values) is done via `function` attribute which is equal either `top` (for finding all maxima) or `bot` (for finding all minima). For example, if `function: "top"`, then the computed label column will be true if the price takes *relative* maximum in this time row.
Relative maximum (top) means that the value is greater than its left and right minimums by certain value.
This label generator returns true or false values by checking whether the price increases (or decreases)
significantly enough during the specified future time horizon.
The generator has the following parameters specified in its configuration:
-`function` is either `high` or `low`:
-`high` value will return true if price increases significantly and false otherwise
-`low` value will return true if price decreases significantly and false otherwise
-`columns` is a list of 3 column names with the role depending on their index (position in the list):
- The first column is used to determine the current (reference) price. Frequently (but not necessarily) it is close price
- The second column is used to determine the price increase and whether the (reference) price changed significantly.
Frequently (but not necessarily) it is high price
- The third column is used to determine the price decrease and whether the (reference) price falls significantly.
Frequently (but not necessarily) it is low price
-`thresholds` is a list of percentage values which determine how much the price has to increase (or decrease) in order for the generator to return true. For each value, one output column is returned
-`tolerance` is a factor relative to the thresholds which then used to determine price change relative to the reference (close) price. For example, if it is 1.0 then it is equal to the threshold. If it is 0.5 then it is 50% of the threshold.
It determines the price move in the opposite direction than the expected one.
If the price moves in this (opposite) direction move than the specified tolerance (earlier than the expected move),
then the generator return false.
-`horizon` is the number of future rows for which the analysis is performed.
If the price does not reach the required level during this period then the generator return false.
Here is an example of two label definitions which generate four binary output columns:
The first generator detects significantly enough increases of the close price during next 10 time intervals.
More specifically, it return two columns: the first column determines 2% price increases (relative to close price and usign the high price) and the second column detects stronger increases by 4% (all other parameters are equal).
What is important, if the price drops by 1% (tolerance 0.5) or 2% (for the second output) *before* it reaches its
threshold, then the generator returns false.
The pair of `thresholds` and `tolerance` work similar to TP and SL levels in trading which determine opposite price
movements. It is important which of these levels is reached first. If the threshold (TP) is reached before tolerance (SL) then true is returned (and if this happens within the horizon).
If tolerance is 0.0 then the label will return true only if the price never falls below the current value
before the threshold is reached. This happens rarely and actually we do not care if the price falls slightly
before increasing significantly. Therefore the tolerance parameter should specify some value which we consider a
*small random movement* or fluctuation which should be ignored when searching for significant changes.