A conventional feature produces an output column from input column using manually specified parameters.
For example, a feature computing moving average of the close price has one parameter which the window size.
This parameter is specified in the feature configuration and does not depend on anything else.
Its value has to be known in advance and is typically taken from experience.
If we want to produce a feature equal to the deviation of the current price from the average price,
then we need to know this average value. We can manually compute the average price by loading historic prices
and then set it in the feature configuration. This configuration value will be then used during evaluation
process to compute the deviation.
In contrast to such conventional features, trainable features learn their parameters from the historic data.
A set of all such parameters is referred to as a *model*. A model could be as simple as one number or as complex as a deep nerual network. It is important only that it is found by analyzing historic data and this process is referred to as *training*.
Therefore features which use trained models will be referred to as trainable features or ML-features.
One example of a trainable feature is where the goal is to find average price from historic data. This model will store
only one number - average price value. Feature evaluation is based on this model and computes the deviation from average
values. In contrast to the previous example where average value has to be known in advance, now the average value is part
of the whole procedure: the feature itself knows how to find its model parameters from the historic data before it can use
these parameters for generating its output.
Trainable features work in two modes by implementing two types of logic:
- How to find feature parameters by analyzing historic data and generating a model
- How to use feature parameters (model) to generate its output values (predictions)
Advantages of trainable features:
- They allow for finding the best parameters rather than rely on intuition or (educated) guess
- They can be regularly re-computed thus their values correspond the latest history by following the drift
- They can learn much more complex dependencies than manually defined features
Drawbacks of trainable features:
- They need significant amount of historic data which is either not available or the amount is not enough to train complex models (without overfitting)
- Manually defined features frequently represent the view of the traders and the view of domain experts. They are already a compressed knowledge with very informative representation. Such features could be rather difficult to learn automatically
Our approach combines the both worlds: it provides a rich mechanism for defining manually parameterized features and it makes it possible to define features which know how to automatically learn their parameters by applying statistical or machine learning algorithsm.
## Defining trainable features
Trainable features are represented separately from other features as a list where one item is a dictionary with one feature definition:
```jsonc
"train_feature_sets": [
{...}, // First trainable feature
{...}, // Second trainable feature
{...} // Third trainable feature
]
```
One feature definition is a dictionary with the following attributes:
```jsonc
{
"generator": "train_features", // This generator is aware of the trainable character of the feature
"columns": [], // Columns used to train and predict. if empty then 'train_features' list is used
"labels": ["bot_2", "top_2"] // Columns used as labels for training
"functions": [
{
"name": "mysvc", "algo": "svc", // Arbitrary name and predefined algorithm type
-`generator`: It is built-in generator function which knows how to evaluate trainable features
-`columns`: It is a list of column names which will be selected and used for training.
In case of empty list, columns from the `train_features` will be used. This attribute lists all columns (without labels)
-`labels`: These column names will be used as true values during training.
In case of empty list, columns from the `labels` will be used. This attribute lists all columns used as labels
-`functions` is a list of algorithm descriptions which dictionaries with these attributes:
-`name` is an arbitrary unique name of this algorithm entry. It will be used as a suffix in the predicted columns (along with label name).
Thus we know that certain output column was generated by this algorithm (and some label).
-`algo` is an algorithm type which resolves to certain Python function.
Currently there are these algorithm types:
-`svc` - Support Vector Machines
-`nn` - Neural Network
-`lc` - Linear Classifier
-`gb` - Gradient Boosting
-`params` provides parameters for the generator which are used to prepare the train data set:
-`is_scale` if true then all columns will be normalized
-`length` is the number of records to use for training. If 0 then `train_length` is used. If it is empty then all available data will be used for training
-`every_nth_row` allows us to select a smaller subset for training
-`is_regression` if true then the label is supposed to be a numeric value and a regression model is trained
-`train` is a dictionary of parameters passed to the algorithm
For each entry in the `train_feature_sets`, the number of generated features is equal to the number of labels.
Each output feature will be named according to this schema: `label_algo` where `label` is a value from the list of labels and `algo` is name of the algorithm in the `algo` attribute.
In the above example, there will be two features and two output columns: `bot_2_mysvc` and `top_2_mysvc`.
The same names are used for the trained models which are stored as files.
If the feature is executed in predict mode, then it works just like other normal features but its model (mean value) will be loaded from the model store.