# Labels ## Labels as features generated from future data *Labels* are normal columns in the dataset which have a special usage in case of ML-features. So what is a ML-feature? It is a normal feature which is defined in terms of some other features and generates one or many new features with new values. What makes it specific is how ML-features are configured. If normal features are configured by static parameters from their definition, then ML-features are configured from parameters stored in their *ML-model*. What is the difference between static parameters and parameters in ML-model? The main difference is that parameters in normal feature definitions are provided by domain experts and maybe found from some ad-hoc analysis. For example, we might want to use 20 days moving average and we write this number as a parameter in the feature definition. If the results are not very good then we could try out some other value like 30 days moving average for this feature. In case of ML-models, their parameter values are found automatically during a special process called *training*. It is a special run mode which signals ML-features that they need to generate their parameters themselves by analyzing the available data and storing the found optimal parameters in the ML-model (as opposed to having parameters in the feature definition). The ML-feature will detect this 'train' mode, use the available data for finding optimal parameters, and use these parameters for generating the output. The output of ML-features is called *predictions*. What is important is that training a ML-algorithm (normally but not always) requires training data represented by features and true data represented by labels. Labels can be viewed as normal columns of the training dataset which have a special role during training but are not needed and not used during prediction. In the case of time-series analysis, labels normally represent future events. For example, a label column can be equal to a future value of some existing feature column like price. In this case, the label column is generated by shifting the price column so that each row has its current price as a normal feature and a label as a future value of this same feature. They have to be in the same row in order for the forecasting algorithm to be able train its parameters. In more complex cases, a label could be some more complex event, for example, whether price increases by 3% during next 10 days. Technically, both feature and labels are columns computed from other columns. Their main difference is that features are computed from past data while labels are computed from future data. In fact, in some cases they could be even implemented as one feature where the usage of past or future data is controlled by some configuration parameter. For example, we could implement the computation of moving average column from either past or future data if the window length has a sign: positive sign means using past data and negative sign of the window parameter means the use of future data (and hence it will be used as a label). ## Label definitions Labels are defined exactly as features, that is, each label has some definition with the same structure as feature definitions. However, since labels have a special role and are needed only in *train* mode, they are defined in a special section: ```jsonc "label_sets": [ {...}, // First label {...}, // Second label {...} // Third label ] ``` In train mode, if `"train": true` in the configuration file, then all columns corresponding to the label definitions will be generated. And all feature columns and label columns will be used by ML-features to train their models. These models will be then used to generate output features as predictions. Label columns are computed by generators and every label definition has `generator` attribute. ITB has its own built-in generators like `topbot2` and `highlow2` described below. But it is possible to define a custom label generator in exactly the same way as it is done for custom features. ## `topbot2` label generator `topbot2` is a label generator which produces binary values by processing future and past values of one input column. This input column is normally price but it could be any other numeric column. For example, it could be a technical indicator like moving average. The goal is to predict some binary event based on the future and past behavior of this input column. The input column name is specified in the `columns` attributes, for example: `columns: "price"`. This label generator returns either true or false depending on whether the current value is a maximum relative to its neighbors or a minimum *relative* to its neigbors. The choise of whether it a return all maxima (as true values) or all minima (also as true values) is done via `function` attribute which is equal either `top` (for finding all maxima) or `bot` (for finding all minima). For example, if `function: "top"`, then the computed label column will be true if the price takes *relative* maximum in this time row. Relative maximum (top) means that the value is greater than its left and right minimums by certain value. For the algorithm, it is important to find all minima and maxima. However, not all of them are selected. The algorithm selects only maxima, which are surrounded by two minima (from left and right) and the both differences are big enough. The minimum required difference between two adjacent extremums is specified in the `level` attribute. For example (for finding all tops), if `level: 0.02`, and the algorithm labels the current moment as true in the output, then this means that there exists a past *minimum* (bottom) with the price lower by 2% and a future minimum also lower by 2%. Note that the existence of these two minima also means that they both have their adjacent maxima, which are higher by 2%. Note also that the minimum level between adjacent minimum and maximum is specified as a factor or portion (not as percent). This value is always positive. Once all maximum and minimum values with the required distance between adjacent values are found, they are marked as true. However, sometimes the direct neighbors are almost equal to the maximum or minimum. For example, the maximum price could be 45,678 but the previous or next price is 45,679 so it is only slightly lower. In this case, we might want to also treat such points as top or bottom in the price development. Therefore, the generator provides `tolerances` attribute which is a list of numbers. One tolerance is interpreted as a fraction of the level (and level is a fraction of the price). For example, if level is 0.1 (10% of the price change) and the tolerance is 0.2 (20% of the level), then the price difference for the tolerance is 0.02 (2%). If top price found is 100.0 then its direct neigbors with price between 98.0 and 100.0 are also marked as top (return value is true). Note that the left and right minimums must have price 90.0 or lower because level is 0.1 (10% of the price). Here is an example which find two binary label columns: ```jsonc "label_sets": [ { "generator": "topbot2", "column_prefix": "", "feature_prefix": "", "config": {"columns": "close", "function": "top", "level": 0.02, "tolerances": [0.1], "names": ["top_2"]} }, { "generator": "topbot2", "column_prefix": "", "feature_prefix": "", "config": {"columns": "close", "function": "bot", "level": 0.02, "tolerances": [0.1], "names": ["bot_2"]} } ] ``` The first level definition (in train mode) will generate a binary column which is true if the close price is a maximum relative to left and right minima with the minimum price change 2% with 0.2% tolerance. The second label definition will find all bottom values with the same level and tolerance. The desired output column names are specified in the `names` attribute. These labels can be used to train a classification ML-algorithm. The algorithm (in predict mode) will predict the probability that the current price is a local maximum (top) and hence will drop significantly in the nearest future or, for the second label, it is a local minimum (bottom) and will increase in the nearest future. ## `highlow2` label generator This label generator returns true or false values by checking whether the price increases (or decreases) significantly enough during the specified future time horizon. The generator has the following parameters specified in its configuration: - `function` is either `high` or `low`: - `high` value will return true if price increases significantly and false otherwise - `low` value will return true if price decreases significantly and false otherwise - `columns` is a list of 3 column names with the role depending on their index (position in the list): - The first column is used to determine the current (reference) price. Frequently (but not necessarily) it is close price - The second column is used to determine the price increase and whether the (reference) price changed significantly. Frequently (but not necessarily) it is high price - The third column is used to determine the price decrease and whether the (reference) price falls significantly. Frequently (but not necessarily) it is low price - `thresholds` is a list of percentage values which determine how much the price has to increase (or decrease) in order for the generator to return true. For each value, one output column is returned - `tolerance` is a factor relative to the thresholds which then used to determine price change relative to the reference (close) price. For example, if it is 1.0 then it is equal to the threshold. If it is 0.5 then it is 50% of the threshold. It determines the price move in the opposite direction than the expected one. If the price moves in this (opposite) direction move than the specified tolerance (earlier than the expected move), then the generator return false. - `horizon` is the number of future rows for which the analysis is performed. If the price does not reach the required level during this period then the generator return false. Here is an example of two label definitions which generate four binary output columns: ```jsonc "label_sets": [ { "generator": "highlow2", "column_prefix": "", "feature_prefix": "", "config": { "columns": ["close", "high", "low"], "function": "high", "thresholds": [2.0, 4.0], "tolerance": 0.5, "horizon": 10, "names": ["high_2", "high_3"] } }, { "column_prefix": "", "generator": "highlow2", "feature_prefix": "", "config": { "columns": ["close", "high", "low"], "function": "low", "thresholds": [2.0, 4.0], "tolerance": 0.5, "horizon": 10, "names": ["low_2", "low_3"] } } ] ``` The first generator detects significantly enough increases of the close price during next 10 time intervals. More specifically, it return two columns: the first column determines 2% price increases (relative to close price and usign the high price) and the second column detects stronger increases by 4% (all other parameters are equal). What is important, if the price drops by 1% (tolerance 0.5) or 2% (for the second output) *before* it reaches its threshold, then the generator returns false. The pair of `thresholds` and `tolerance` work similar to TP and SL levels in trading which determine opposite price movements. It is important which of these levels is reached first. If the threshold (TP) is reached before tolerance (SL) then true is returned (and if this happens within the horizon). If tolerance is 0.0 then the label will return true only if the price never falls below the current value before the threshold is reached. This happens rarely and actually we do not care if the price falls slightly before increasing significantly. Therefore the tolerance parameter should specify some value which we consider a *small random movement* or fluctuation which should be ignored when searching for significant changes.