intelligent-trading-bot/docs/trainable-features.md

Name: MQL5 Algo Forge
Brand: MQL5
# Trainable features

## Why to train?

A conventional feature produces an output column from input column using manually specified parameters.
For example, a feature computing moving average of the close price has one parameter which the window size.
This parameter is specified in the feature configuration and does not depend on anything else.
Its value has to be known in advance and is typically taken from experience.
If we want to produce a feature equal to the deviation of the current price from the average price,
then we need to know this average value. We can manually compute the average price by loading historic prices
and then set it in the feature configuration. This configuration value will be then used during evaluation
process to compute the deviation.

In contrast to such conventional features, trainable features learn their parameters from the historic data.
A set of all such parameters is referred to as a *model*. A model could be as simple as one number or as complex as a deep nerual network. It is important only that it is found by analyzing historic data and this process is referred to as *training*.
Therefore features which use trained models will be referred to as trainable features or ML-features.

One example of a trainable feature is where the goal is to find average price from historic data. This model will store
only one number - average price value. Feature evaluation is based on this model and computes the deviation from average
values. In contrast to the previous example where average value has to be known in advance, now the average value is part
of the whole procedure: the feature itself knows how to find its model parameters from the historic data before it can use
these parameters for generating its output.

Trainable features work in two modes by implementing two types of logic:
- How to find feature parameters by analyzing historic data and generating a model
- How to use feature parameters (model) to generate its output values (predictions)

Advantages of trainable features:
- They allow for finding the best parameters rather than rely on intuition or (educated) guess
- They can be regularly re-computed thus their values correspond the latest history by following the drift
- They can learn much more complex dependencies than manually defined features

Drawbacks of trainable features:
- They need significant amount of historic data which is either not available or the amount is not enough to train complex models (without overfitting)
- Manually defined features frequently represent the view of the traders and the view of domain experts. They are already a compressed knowledge with very informative representation. Such features could be rather difficult to learn automatically

Our approach combines the both worlds: it provides a rich mechanism for defining manually parameterized features and it makes it possible to define features which know how to automatically learn their parameters by applying statistical or machine learning algorithsm.

## Defining trainable features

Trainable features are represented separately from other features as a list where one item is a dictionary with one feature definition:
```jsonc
"train_feature_sets": [
  {...}, // First trainable feature
  {...}, // Second trainable feature
  {...} // Third trainable feature
]
```

One feature definition is a dictionary with the following attributes:
```jsonc
{
  "generator": "train_features", // This generator is aware of the trainable character of the feature
  "columns": [], // Columns used to train and predict. if empty then 'train_features' list is used
  "labels": ["bot_2", "top_2"] // Columns used as labels for training
  "functions": [
    {
    "name": "mysvc", "algo": "svc", // Arbitrary name and predefined algorithm type
    "params": {"is_scale": true, "length": 0}, // Preprocessing parameters
    "train": {"C": 1.0, "gamma": 0.005} // Algorith arguments
    }
  ]
```

Here is the purpose of the attributes:
- `generator`: It is built-in generator function which knows how to evaluate trainable features
- `columns`: It is a list of column names which will be selected and used for training.
In case of empty list, columns from the `train_features` will be used. This attribute lists all columns (without labels)
- `labels`: These column names will be used as true values during training.
In case of empty list, columns from the `labels` will be used. This attribute lists all columns used as labels
- `functions` is a list of algorithm descriptions which dictionaries with these attributes:
  - `name` is an arbitrary unique name of this algorithm entry. It will be used as a suffix in the predicted columns (along with label name).
  Thus we know that certain output column was generated by this algorithm (and some label).
  - `algo` is an algorithm type which resolves to certain Python function.
  Currently there are these algorithm types:
    - `svc` - Support Vector Machines
    - `nn` - Neural Network
    - `lc` - Linear Classifier
    - `gb` - Gradient Boosting
  - `params` provides parameters for the generator which are used to prepare the train data set:
    - `is_scale` if true then all columns will be normalized
    - `length` is the number of records to use for training. If 0 then `train_length` is used. If it is empty then all available data will be used for training
    - `every_nth_row` allows us to select a smaller subset for training
    - `is_regression` if true then the label is supposed to be a numeric value and a regression model is trained
  - `train` is a dictionary of parameters passed to the algorithm

For each entry in the `train_feature_sets`, the number of generated features is equal to the number of labels.
Each output feature will be named according to this schema: `label_algo` where `label` is a value from the list of labels and `algo` is name of the algorithm in the `algo` attribute.
In the above example, there will be two features and two output columns: `bot_2_mysvc` and `top_2_mysvc`.
The same names are used for the trained models which are stored as files.

## Custom trainable features

A new way to define trainable features is compatible with how normal [features](features.md) are defined.
Such feature definitions are placed in `feature_sets` or `train_features` sections
but not in `train_feature_sets` section of the configuration.

It is necessary to provide a custom generator function but this function has to be aware of two modes: train and predict.
It has to check whether it runs in train mode via the global binary `train` attribute. If it is true then
it has to train its model using the available data and then perform feature evaluation using this newly trained model.
If it runs in predict mode (train attribute is false), then it re-uses the previously trained model.

Here is an example how such a trainable feature is defined if we want to train a model which finds average value
which will be then used for finding deviation of current values from it.
```jsonc
{
  "generator": "myextensions.stats:deviation_feature",
  "config": { // This is passed to the generator function
    "columns": "close", // for which column find average
    "function": "mean", // Other functions could be supported like median
    "names": "deviation", // Predicted column name
    "parameters": {} // Whatever other parameters we might need
  }
}
```

This generator function can be implemented as follows:
```python
def deviation_feature(df, config: dict, global_config: dict, model_store: ModelStore):
    column_name = config.get('columns')
    function = config.get('function')
    names = config.get('names')  # Output feature name
    if not names:
        names = f"{column_name}_{function}"

    # Load model
    model_name = config.get('model_name', f"{names}")
    model = None
    model = model_store.get_model(model_name)

    # Determine if train is needed before predict
    is_train = global_config.get('train')
    # If in train mode, then find the scale parameters from data, store in the model (to be used below for prediction)
    if is_train:
        mean = df[column_name].mean()  # Find mean value
        model = dict(mean=mean)  # Create model as a dict
        model_store.put_model(model_name, model)  # Submit and store model persistently

    # Now, we have a model: either loaded from the store or trained. Do normal evaluation
    if function == 'mean':
        out_column = df[column_name] - model.get("mean")

    df[names] = out_column

    return df, [names]

```

What makes this code specific is the train section which is executed if train mode is detected.
In train mode, data in the specified column is used to find mean value which is the only parameter of the model.
After that, the column is really transformed by subtracting this mean value and this transformed column is returned as a
new generated feature.

If the feature is executed in predict mode, then it works just like other normal features but its model (mean value) will be loaded from the model store.