8.8 KiB
Trainable features
Why to train?
A conventional feature produces an output column from input column using manually specified parameters. For example, a feature computing moving average of the close price has one parameter which the window size. This parameter is specified in the feature configuration and does not depend on anything else. Its value has to be known in advance and is typically taken from experience. If we want to produce a feature equal to the deviation of the current price from the average price, then we need to know this average value. We can manually compute the average price by loading historic prices and then set it in the feature configuration. This configuration value will be then used during evaluation process to compute the deviation.
In contrast to such conventional features, trainable features learn their parameters from the historic data. A set of all such parameters is referred to as a model. A model could be as simple as one number or as complex as a deep nerual network. It is important only that it is found by analyzing historic data and this process is referred to as training. Therefore features which use trained models will be referred to as trainable features or ML-features.
One example of a trainable feature is where the goal is to find average price from historic data. This model will store only one number - average price value. Feature evaluation is based on this model and computes the deviation from average values. In contrast to the previous example where average value has to be known in advance, now the average value is part of the whole procedure: the feature itself knows how to find its model parameters from the historic data before it can use these parameters for generating its output.
Trainable features work in two modes by implementing two types of logic:
- How to find feature parameters by analyzing historic data and generating a model
- How to use feature parameters (model) to generate its output values (predictions)
Advantages of trainable features:
- They allow for finding the best parameters rather than rely on intuition or (educated) guess
- They can be regularly re-computed thus their values correspond the latest history by following the drift
- They can learn much more complex dependencies than manually defined features
Drawbacks of trainable features:
- They need significant amount of historic data which is either not available or the amount is not enough to train complex models (without overfitting)
- Manually defined features frequently represent the view of the traders and the view of domain experts. They are already a compressed knowledge with very informative representation. Such features could be rather difficult to learn automatically
Our approach combines the both worlds: it provides a rich mechanism for defining manually parameterized features and it makes it possible to define features which know how to automatically learn their parameters by applying statistical or machine learning algorithsm.
Defining trainable features
Trainable features are represented separately from other features as a list where one item is a dictionary with one feature definition:
"train_feature_sets": [
{...}, // First trainable feature
{...}, // Second trainable feature
{...} // Third trainable feature
]
One feature definition is a dictionary with the following attributes:
{
"generator": "train_features", // This generator is aware of the trainable character of the feature
"columns": [], // Columns used to train and predict. if empty then 'train_features' list is used
"labels": ["bot_2", "top_2"] // Columns used as labels for training
"functions": [
{
"name": "mysvc", "algo": "svc", // Arbitrary name and predefined algorithm type
"params": {"is_scale": true, "length": 0}, // Preprocessing parameters
"train": {"C": 1.0, "gamma": 0.005} // Algorith arguments
}
]
Here is the purpose of the attributes:
generator: It is built-in generator function which knows how to evaluate trainable featurescolumns: It is a list of column names which will be selected and used for training. In case of empty list, columns from thetrain_featureswill be used. This attribute lists all columns (without labels)labels: These column names will be used as true values during training. In case of empty list, columns from thelabelswill be used. This attribute lists all columns used as labelsfunctionsis a list of algorithm descriptions which dictionaries with these attributes:nameis an arbitrary unique name of this algorithm entry. It will be used as a suffix in the predicted columns (along with label name). Thus we know that certain output column was generated by this algorithm (and some label).algois an algorithm type which resolves to certain Python function. Currently there are these algorithm types:svc- Support Vector Machinesnn- Neural Networklc- Linear Classifiergb- Gradient Boosting
paramsprovides parameters for the generator which are used to prepare the train data set:is_scaleif true then all columns will be normalizedlengthis the number of records to use for training. If 0 thentrain_lengthis used. If it is empty then all available data will be used for trainingevery_nth_rowallows us to select a smaller subset for trainingis_regressionif true then the label is supposed to be a numeric value and a regression model is trained
trainis a dictionary of parameters passed to the algorithm
For each entry in the train_feature_sets, the number of generated features is equal to the number of labels.
Each output feature will be named according to this schema: label_algo where label is a value from the list of labels and algo is name of the algorithm in the algo attribute.
In the above example, there will be two features and two output columns: bot_2_mysvc and top_2_mysvc.
The same names are used for the trained models which are stored as files.
Custom trainable features
A new way to define trainable features is compatible with how normal features are defined.
Such feature definitions are placed in feature_sets or train_features sections
but not in train_feature_sets section of the configuration.
It is necessary to provide a custom generator function but this function has to be aware of two modes: train and predict.
It has to check whether it runs in train mode via the global binary train attribute. If it is true then
it has to train its model using the available data and then perform feature evaluation using this newly trained model.
If it runs in predict mode (train attribute is false), then it re-uses the previously trained model.
Here is an example how such a trainable feature is defined if we want to train a model which finds average value which will be then used for finding deviation of current values from it.
{
"generator": "myextensions.stats:deviation_feature",
"config": { // This is passed to the generator function
"columns": "close", // for which column find average
"function": "mean", // Other functions could be supported like median
"names": "deviation", // Predicted column name
"parameters": {} // Whatever other parameters we might need
}
}
This generator function can be implemented as follows:
def deviation_feature(df, config: dict, global_config: dict, model_store: ModelStore):
column_name = config.get('columns')
function = config.get('function')
names = config.get('names') # Output feature name
if not names:
names = f"{column_name}_{function}"
# Load model
model_name = config.get('model_name', f"{names}")
model = None
model = model_store.get_model(model_name)
# Determine if train is needed before predict
is_train = global_config.get('train')
# If in train mode, then find the scale parameters from data, store in the model (to be used below for prediction)
if is_train:
mean = df[column_name].mean() # Find mean value
model = dict(mean=mean) # Create model as a dict
model_store.put_model(model_name, model) # Submit and store model persistently
# Now, we have a model: either loaded from the store or trained. Do normal evaluation
if function == 'mean':
out_column = df[column_name] - model.get("mean")
df[names] = out_column
return df, [names]
What makes this code specific is the train section which is executed if train mode is detected. In train mode, data in the specified column is used to find mean value which is the only parameter of the model. After that, the column is really transformed by subtracting this mean value and this transformed column is returned as a new generated feature.
If the feature is executed in predict mode, then it works just like other normal features but its model (mean value) will be loaded from the model store.