의 미러
https://github.com/asavinov/intelligent-trading-bot.git
synced 2026-05-04 08:26:19 +00:00
154 lines
8.8 KiB
Markdown
154 lines
8.8 KiB
Markdown
# Trainable features
|
|
|
|
## Why to train?
|
|
|
|
A conventional feature produces an output column from input column using manually specified parameters.
|
|
For example, a feature computing moving average of the close price has one parameter which the window size.
|
|
This parameter is specified in the feature configuration and does not depend on anything else.
|
|
Its value has to be known in advance and is typically taken from experience.
|
|
If we want to produce a feature equal to the deviation of the current price from the average price,
|
|
then we need to know this average value. We can manually compute the average price by loading historic prices
|
|
and then set it in the feature configuration. This configuration value will be then used during evaluation
|
|
process to compute the deviation.
|
|
|
|
In contrast to such conventional features, trainable features learn their parameters from the historic data.
|
|
A set of all such parameters is referred to as a *model*. A model could be as simple as one number or as complex as a deep nerual network. It is important only that it is found by analyzing historic data and this process is referred to as *training*.
|
|
Therefore features which use trained models will be referred to as trainable features or ML-features.
|
|
|
|
One example of a trainable feature is where the goal is to find average price from historic data. This model will store
|
|
only one number - average price value. Feature evaluation is based on this model and computes the deviation from average
|
|
values. In contrast to the previous example where average value has to be known in advance, now the average value is part
|
|
of the whole procedure: the feature itself knows how to find its model parameters from the historic data before it can use
|
|
these parameters for generating its output.
|
|
|
|
Trainable features work in two modes by implementing two types of logic:
|
|
- How to find feature parameters by analyzing historic data and generating a model
|
|
- How to use feature parameters (model) to generate its output values (predictions)
|
|
|
|
Advantages of trainable features:
|
|
- They allow for finding the best parameters rather than rely on intuition or (educated) guess
|
|
- They can be regularly re-computed thus their values correspond the latest history by following the drift
|
|
- They can learn much more complex dependencies than manually defined features
|
|
|
|
Drawbacks of trainable features:
|
|
- They need significant amount of historic data which is either not available or the amount is not enough to train complex models (without overfitting)
|
|
- Manually defined features frequently represent the view of the traders and the view of domain experts. They are already a compressed knowledge with very informative representation. Such features could be rather difficult to learn automatically
|
|
|
|
Our approach combines the both worlds: it provides a rich mechanism for defining manually parameterized features and it makes it possible to define features which know how to automatically learn their parameters by applying statistical or machine learning algorithsm.
|
|
|
|
## Defining trainable features
|
|
|
|
Trainable features are represented separately from other features as a list where one item is a dictionary with one feature definition:
|
|
```jsonc
|
|
"train_feature_sets": [
|
|
{...}, // First trainable feature
|
|
{...}, // Second trainable feature
|
|
{...} // Third trainable feature
|
|
]
|
|
```
|
|
|
|
One feature definition is a dictionary with the following attributes:
|
|
```jsonc
|
|
{
|
|
"generator": "train_features", // This generator is aware of the trainable character of the feature
|
|
"columns": [], // Columns used to train and predict. if empty then 'train_features' list is used
|
|
"labels": ["bot_2", "top_2"] // Columns used as labels for training
|
|
"functions": [
|
|
{
|
|
"name": "mysvc", "algo": "svc", // Arbitrary name and predefined algorithm type
|
|
"params": {"is_scale": true, "length": 0}, // Preprocessing parameters
|
|
"train": {"C": 1.0, "gamma": 0.005} // Algorith arguments
|
|
}
|
|
]
|
|
```
|
|
|
|
Here is the purpose of the attributes:
|
|
- `generator`: It is built-in generator function which knows how to evaluate trainable features
|
|
- `columns`: It is a list of column names which will be selected and used for training.
|
|
In case of empty list, columns from the `train_features` will be used. This attribute lists all columns (without labels)
|
|
- `labels`: These column names will be used as true values during training.
|
|
In case of empty list, columns from the `labels` will be used. This attribute lists all columns used as labels
|
|
- `functions` is a list of algorithm descriptions which dictionaries with these attributes:
|
|
- `name` is an arbitrary unique name of this algorithm entry. It will be used as a suffix in the predicted columns (along with label name).
|
|
Thus we know that certain output column was generated by this algorithm (and some label).
|
|
- `algo` is an algorithm type which resolves to certain Python function.
|
|
Currently there are these algorithm types:
|
|
- `svc` - Support Vector Machines
|
|
- `nn` - Neural Network
|
|
- `lc` - Linear Classifier
|
|
- `gb` - Gradient Boosting
|
|
- `params` provides parameters for the generator which are used to prepare the train data set:
|
|
- `is_scale` if true then all columns will be normalized
|
|
- `length` is the number of records to use for training. If 0 then `train_length` is used. If it is empty then all available data will be used for training
|
|
- `every_nth_row` allows us to select a smaller subset for training
|
|
- `is_regression` if true then the label is supposed to be a numeric value and a regression model is trained
|
|
- `train` is a dictionary of parameters passed to the algorithm
|
|
|
|
For each entry in the `train_feature_sets`, the number of generated features is equal to the number of labels.
|
|
Each output feature will be named according to this schema: `label_algo` where `label` is a value from the list of labels and `algo` is name of the algorithm in the `algo` attribute.
|
|
In the above example, there will be two features and two output columns: `bot_2_mysvc` and `top_2_mysvc`.
|
|
The same names are used for the trained models which are stored as files.
|
|
|
|
## Custom trainable features
|
|
|
|
A new way to define trainable features is compatible with how normal [features](features.md) are defined.
|
|
Such feature definitions are placed in `feature_sets` or `train_features` sections
|
|
but not in `train_feature_sets` section of the configuration.
|
|
|
|
It is necessary to provide a custom generator function but this function has to be aware of two modes: train and predict.
|
|
It has to check whether it runs in train mode via the global binary `train` attribute. If it is true then
|
|
it has to train its model using the available data and then perform feature evaluation using this newly trained model.
|
|
If it runs in predict mode (train attribute is false), then it re-uses the previously trained model.
|
|
|
|
Here is an example how such a trainable feature is defined if we want to train a model which finds average value
|
|
which will be then used for finding deviation of current values from it.
|
|
```jsonc
|
|
{
|
|
"generator": "myextensions.stats:deviation_feature",
|
|
"config": { // This is passed to the generator function
|
|
"columns": "close", // for which column find average
|
|
"function": "mean", // Other functions could be supported like median
|
|
"names": "deviation", // Predicted column name
|
|
"parameters": {} // Whatever other parameters we might need
|
|
}
|
|
}
|
|
```
|
|
|
|
This generator function can be implemented as follows:
|
|
```python
|
|
def deviation_feature(df, config: dict, global_config: dict, model_store: ModelStore):
|
|
column_name = config.get('columns')
|
|
function = config.get('function')
|
|
names = config.get('names') # Output feature name
|
|
if not names:
|
|
names = f"{column_name}_{function}"
|
|
|
|
# Load model
|
|
model_name = config.get('model_name', f"{names}")
|
|
model = None
|
|
model = model_store.get_model(model_name)
|
|
|
|
# Determine if train is needed before predict
|
|
is_train = global_config.get('train')
|
|
# If in train mode, then find the scale parameters from data, store in the model (to be used below for prediction)
|
|
if is_train:
|
|
mean = df[column_name].mean() # Find mean value
|
|
model = dict(mean=mean) # Create model as a dict
|
|
model_store.put_model(model_name, model) # Submit and store model persistently
|
|
|
|
# Now, we have a model: either loaded from the store or trained. Do normal evaluation
|
|
if function == 'mean':
|
|
out_column = df[column_name] - model.get("mean")
|
|
|
|
df[names] = out_column
|
|
|
|
return df, [names]
|
|
|
|
```
|
|
|
|
What makes this code specific is the train section which is executed if train mode is detected.
|
|
In train mode, data in the specified column is used to find mean value which is the only parameter of the model.
|
|
After that, the column is really transformed by subtracting this mean value and this transformed column is returned as a
|
|
new generated feature.
|
|
|
|
If the feature is executed in predict mode, then it works just like other normal features but its model (mean value) will be loaded from the model store.
|