8.7 KiB
Scripts
Batch scripts are intended for analyzing data in files as opposed to incremental analysis of data in-memory. One script has one purpose. It typically loads data from one input file, processes the data, and stores the result in an output file. Having multiple scripts simplifies the development of complex data processing pipelines consisting of several steps.
There are the following major purposes of scripts:
- Data retrieval (data collectors) and merge of different data sources
- Evaluation of features and producing new derived columns
- Training machine learning models
- Executing output adapters
- Simulation and backtesting
File names can be specified in these attributes of the configuration:
merge_file_name: "data.csv",feature_file_name: "features.csv",matrix_file_name: "matrix.csv",predict_file_name: "predictions.csv",signal_file_name: "signals.csv",signal_models_file_name: "signal_models", The file extension is used to determine the format: CSV or parquet.
Download and merge
The download script retrievs data the data sources listed in the data_sources section:
python -m scripts.download -c config.json
Quote name (also subfolder) and file name ("klines" by default) are specified in the entries of the data source. If a source file already exists, then the script will retrieve only latest missing data and append these rows to the file. Otherwise, all available data will be retrieved and stored in a new file.
The merge script is used to combine all source files specified in the data_sources section and store the result in one output file:
python -m scripts.merge -c config.json
The output file name and format is specified in the merge_file_name attribute.
The file is stored in the symbol subfolder.
Columns will get a prefix (if specified in the data source entry) and all rows will be aligned according to the data frequency.
Accordintly, gaps can appear in case of missing data in some data source.
Feature engineering
These scripts are intended for applying feature definitions and generating new columns. Currently features can be defined in several ways and hence there are several groups of feature definitions. There is one script for each such group of features.
Features
Features defined in the feature_sets section are evaluated using the features script:
python -m scripts.features -c config.json
It reads data from its input file specified in the feature_sets parameter ("data" by default),
generates features defined in the feature_sets section, and stores the result in the output file
specified in the feature_file_name parameter ("features" by default).
Labels
Features defined in the label_sets section are evaluated using the labels script:
python -m scripts.labels -c config.json
It reads its input data from the output of the features script.
The result table with appended label columns is stored in the matrix_file_name file ("matrix" by default).
Predict
The predict script evaluates features defined in the train_feature_sets section:
python -m scripts.predict -c config.json
It evaluates these trainable features in predict mode by using existing models (previously trained by the train script).
The data is loaded from the output of the labels script.
The result is stored in the predict_file_name file ("predictions" by default).
Signals
This script evaluates features defined in the signal_sets section:
python -m scripts.signals -c config.json
Input data is loaded from the predict_file_name file (output of the predict script).
The result is stored in the signal_file_name file.
Train machine learning models
The train script runs feature definitions from train_feature_sets section in train mode
python -m scripts.train -c config.json
Running in train mode means that these features use the data to find optimal parameters of the features (models) but do not apply these parameters.
The results of the script are model files stored in the MODELS subfoler.
These model files will be used by the predict script to generate feature output columns.
Outputs
The output script runs the definitions from the output_sets section:
python -m scripts.output -c config.json
They do not produce new features but rather use the existing data to send notifications, visualizations, do trding etc.
Other scripts
Generate rolling predictions
In real trading, we use machine learning models trained on previous data in order to predict future behavior.
The predict_rolling is intended for repeating this sequence:
- train models using available historic data
- apply these models to predict future behavior
- append (small) new data chunk and repeat the train-predict analysis The result is a table with predictions which are generated in small chunks using the models trained on the previous data (which also gradueally grows). Thus we simulate the train-predict steps by moving in time as if we ware doing it in real context. For example, for daily data, we could use monthly data chunks in order to keep our models up-to-date. The script will start from some initial date, say, 1 year ago, and then do 12 iterations by training and predicting for each new month. The result has 1 year of predictions but these predictions are done in a "fair" manner closely to real environment.
The script is executed like other scripts:
python -m scripts.predict_rolling -c config.json
It loads data from the matrix_file_name file ("matrix" by default).
The result is stored in the predict_file_name file ("predictions" by default).
The parameters for iterations are specified in the rolling_predict section:
data_start(default 0) Trancate input data from startdata_end(default null) Trancate input data from endprediction_startFirst row for starting iterations (end of the first training data), for example, "2020-01-02"prediction_sizeNew data chunk size in rows and the size of each predictionprediction_stepsHow many iterations to perform, for example, 12 if we want to do 12 train-predict iterations. If None or 0, then from prediction start till the data enduse_multiprocessingWhether to use multiprocessing (true or false)max_workersHow many workers in case of multiprocessing
Simulation and backtesting
Data analysis in ITB is based on generating new features including features based on machine learning algorithms. This allows us to know more about the future behavior like probablity that the price will be 2% higher during the next week. However, this knowledge by itself cannot be used directly for trading and says almost nothing about trade performance. The main reason is that there can be many possible trade strategies based on the knowledge of future behavior. These trade strategies produce buy and sell signals depending on the current situation on the market as well as the current state of the available assets. The simulation script is intended for filling this gap by evaluating trade performance based on the chosen trade strategy and its parameters.
The simulation (backtesting) parameters are specified in the simulate_model section of the configuration file:
python -m scripts.simulate -c config.json
The script consumes predictions from the predict_file_name file.
This file includes all derived features and should be produced by the predict_rolling script.
Note that the script does not find performance for one set of parameters. Rather, it evaluates a (large) set of parameters with the goal to find best performing trade model.
The simulate_model section describes a set of trade models and has the following attributes:
data_startStart simulation from this date, for example, "1998-01-05"data_endEnd simulation with this date. If empty then all data will be useddirection"long" or "short"topn_to_storeHow many best performing trade models (their parameters) to store in the output filesignal_generator: Name of the generator which produces the column used to make buy/sell decisionsbuy_sell_equaltrue or false. Whether the threshold parameters for buy and sell signals are symmetricgriddictionary with lists. Each combination of values from these lists is considered one model // Attributes are ists or evaluated to lists if stringbuy_signal_thresholdIt is a list of values interpreted as buy thresholds to test like [0.1, 0.2, 0.3] If it is a string then it is evaluated as Python code and the result is expected to be a list of values, for example, "np.arange(0.0, 0.5, 0.1).tolist()"sell_signal_thresholdSame as the previous parameter but used to find sell signals