intelligent-trading-bot/docs/scripts.md

(OUTDATED)

# Scripts

Scripts are used for retrieving data, computing derived features, training ML models, hyper-parameter search, training signal models.

Most scripts rely on the `App` class and its configuration parameters. They also load a configuration file with credentials and some other parameters which overwrite those in the `App.config`. Many more specific parameters or parameters used for debugging are defined in code itself.

## Download the latest historic data

Execute: `python -m scripts.download -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder

Purpose: Download raw data in format used for both training and prediction. Kline format is used and two types of data can be downloaded: spot and futures.

Notes:
* Edit main in binance-data.py by setting necessary symbol and function arguments
* Script can be started from local folder and will store result in this folder.
* The script will check if the previous file in the current folder and append it with new data. If not found, then new file is created and all data will be retrieved
* Currently we use 2 separate sources stored in 2 separate files:
  * klines (spot market), its history is quite long
  * futures (also klines but from futures market), its history is relative short

## Merge several historic datasets into one dataset

Execute: `python -m scripts.merge -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder

Purpose: Merge spot and futures data as well as align to regular time raster

Notes:
* Merge historic data into one dataset. We analyse data using one common time raster and different column names. Also, we fix problems with gaps by producing a uniform time raster. Note that columns from different source files will have different history length so short files will have Nones in the long file.
* If necessary, edit input data file locations as absolute paths

## Generate feature matrix

Execute: `python -m scripts.features -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder
* Input: data file with raw values
* Output: feature matrix

Purpose: Compute derived features and derived labels by attaching them as new columns and producing a feature matrix

Here we compute derived features (also using window-functions) and produce a feature matrix. We also compute target features (labels). Note that we compute many possible features and labels but not all of them have to be used. In parameters, we define past history length (windows) and future horizon for labels. Currently, we generate 3 kinds of features independently: klines features (source 1), future features (source 2), and label features (our possible predictions targets).

Notes:
* Ensure that latest source data has been downloaded from binance server (previous step)
* The goal s to load source (kline) data, generate derived features and labels, and store the result in output file. The output is supposed to be used for other procedures like training prediction models.
* Max past window and max future horizon are currently not used (None will be stored)
* Future horizon for labels is hard-coded. Change if necessary
* If necessary, uncomment line with storing to parquet (install the packages)
* Output file will store features and labels as they are implemented in the trade module
* Same number of lines in output as in input file

## Train prediction models

Execute: `python -m scripts.train -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder
* Hyper-parameters: currently specified in code in `train.py`
* Input: feature matrix
* Output: train model files

Purpose: Train models using the latest data (feature matrix) so that these models can be used for prediction in the service

Notes:
* There can be many predicted features and models, for example, for spot and future markets or based on different prediction algorithms or historic horizons
* The procedure will consume feature matrix and hence the following files should be updated: source data, merge files, generate features (no need to generate rolling features).
* The generated models have to be copied to the folder where they are found by the signal/trade server

## Generate rolling predictions

Execute: `python -m scripts.predict_rolling -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder
* Hyper-parameters: currently specified in code in `predict_rolling.py`
* Input data: feature matrix
* Output: data file with rolling predictions

Purpose: Simulate train-predict step by moving in time as if we ware doing it in real service, that is, add a new data batch, use the available data for training a new model, use this model for predictions, save these predictions, add new data batch and so on

Generate rolling predictions. Here we train a model using previous data less frequently, say, once per day or week, but use much more previous data than in typical window-based features. We apply then one constant model to predict values for the future time until it is re-trained again using the newest data. (If the re-train frequency is equal to sample rate, that is, we do it for each new row, then we get normal window-based derived feature with large window sizes.) Each feature is based on some algorithm with some hyper-parameters and some history length. This procedure does not choose best hyper-parameters - for that purpose we need some other procedure, which will optimize the predicted values with the real ones. Normally, the target values of these features are directly related to what we really want to predict, that is, to some label. Output of this procedure is same file (feature matrix) with additional predicted features (scores). This file however will be much shorter because we need some quite long history for some features (say, 1 year). Note that for applying rolling predictions, we have to know hyper-parameters which can be found by a simpler procedure.

Notes:
* Many train-predict cycles are executed and hence it takes significant time
* Prerequisite: We already have to know the best prediction model(s) and its best parameters
* There can be several models used for rolling predictions
* Essentially, the predicting models are treated here as (more complex) feature definitions
* Choose the best model and its parameters using grid search (below)
* The results of this step are consumed by signal generator and backtesting

## Train signal models

Execute: `python -m scripts.simulate -c config.json`

Parameters:
* `symbol` pair to load
* `data_folder` data folder
* Hyper-parameters: currently specified in code in `predict_rolling.py`
* Input data: rolling predictions
* Output: table with signal generation parameters and the corresponding performance

Purpose: Find the best parameters for signal generation like score thresholds by analyzing the results of rolling predictions

The input is a feature matrix with all scores (predicted features). Our goal is to define a feature the output of which will be directly used for buy/sell decisions. We need search for the best hyper-parameters starting from simple score threshold and ending with some data mining algorithm.

Notes:
* Important: we use a procedure for final score computation which must be the same as in the service during online signal generation
* We consume the results of rolling predictions
* We assume that rolling prediction produce many highly informative features
* The grid search (brute force) of this step has to test our trading strategy using back testing as (direct) metric. In other words, trading performance on historic data is our metric for brute force or simple ML 
* Normally the result is some thresholds or some simple ML model
* Important: The results of this step are consumed in the production service to generate signals 

## (Grid) search for best parameters of and/or best prediction models

Execute: `python -m scripts.grid_search`

Purpose: It is a conventional procedure for hyper-parameter optimization by searching in the space of hyper-parameters. It can be any other procedure. The best hyper-parameters found are copied then to the scripts where they are used like train models or rolling predictions.

Notes:
* The results are consumed by the rolling prediction step
* There can be many algorithms and many historic horizons or input feature set
release preparation 2025-04-27 17:33:34 +02:00			`(OUTDATED)`

scripts improvements 2021-09-09 08:20:29 +02:00			`# Scripts`
split readme into documentation files 2021-08-29 17:28:04 +02:00
scripts improvements 2021-09-09 08:20:29 +02:00			`Scripts are used for retrieving data, computing derived features, training ML models, hyper-parameter search, training signal models.`
split readme into documentation files 2021-08-29 17:28:04 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			Most scripts rely on the `App` class and its configuration parameters. They also load a configuration file with credentials and some other parameters which overwrite those in the `App.config`. Many more specific parameters or parameters used for debugging are defined in code itself.
split readme into documentation files 2021-08-29 17:28:04 +02:00
improve notifier and configuration parameters 2021-10-31 14:09:21 +01:00			`## Download the latest historic data`
split readme into documentation files 2021-08-29 17:28:04 +02:00
rename scripts 2022-08-27 13:05:49 +02:00			Execute: `python -m scripts.download -c config.json`
scripts improvements 2021-09-09 08:20:29 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder

			`Purpose: Download raw data in format used for both training and prediction. Kline format is used and two types of data can be downloaded: spot and futures.`

scripts improvements 2021-09-09 08:20:29 +02:00			`Notes:`
			`* Edit main in binance-data.py by setting necessary symbol and function arguments`
			`* Script can be started from local folder and will store result in this folder.`
			`* The script will check if the previous file in the current folder and append it with new data. If not found, then new file is created and all data will be retrieved`
			`* Currently we use 2 separate sources stored in 2 separate files:`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* klines (spot market), its history is quite long`
			`* futures (also klines but from futures market), its history is relative short`

parameterizing scripts 2021-09-09 20:48:11 +02:00			`## Merge several historic datasets into one dataset`
split readme into documentation files 2021-08-29 17:28:04 +02:00
rename scripts 2022-08-27 13:05:49 +02:00			Execute: `python -m scripts.merge -c config.json`
split readme into documentation files 2021-08-29 17:28:04 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder

			`Purpose: Merge spot and futures data as well as align to regular time raster`

split readme into documentation files 2021-08-29 17:28:04 +02:00			`Notes:`
scripts improvements 2021-09-09 08:20:29 +02:00			`* Merge historic data into one dataset. We analyse data using one common time raster and different column names. Also, we fix problems with gaps by producing a uniform time raster. Note that columns from different source files will have different history length so short files will have Nones in the long file.`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* If necessary, edit input data file locations as absolute paths`

scripts improvements 2021-09-09 08:20:29 +02:00			`## Generate feature matrix`
split readme into documentation files 2021-08-29 17:28:04 +02:00
rename scripts 2022-08-27 13:05:49 +02:00			Execute: `python -m scripts.features -c config.json`
split readme into documentation files 2021-08-29 17:28:04 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder
configuration options for train signal models 2021-10-10 11:47:46 +02:00			`* Input: data file with raw values`
			`* Output: feature matrix`
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00
			`Purpose: Compute derived features and derived labels by attaching them as new columns and producing a feature matrix`

scripts improvements 2021-09-09 08:20:29 +02:00			`Here we compute derived features (also using window-functions) and produce a feature matrix. We also compute target features (labels). Note that we compute many possible features and labels but not all of them have to be used. In parameters, we define past history length (windows) and future horizon for labels. Currently, we generate 3 kinds of features independently: klines features (source 1), future features (source 2), and label features (our possible predictions targets).`
split readme into documentation files 2021-08-29 17:28:04 +02:00
			`Notes:`
			`* Ensure that latest source data has been downloaded from binance server (previous step)`
scripts improvements 2021-09-09 08:20:29 +02:00			`* The goal s to load source (kline) data, generate derived features and labels, and store the result in output file. The output is supposed to be used for other procedures like training prediction models.`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* Max past window and max future horizon are currently not used (None will be stored)`
scripts improvements 2021-09-09 08:20:29 +02:00			`* Future horizon for labels is hard-coded. Change if necessary`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* If necessary, uncomment line with storing to parquet (install the packages)`
scripts improvements 2021-09-09 08:20:29 +02:00			`* Output file will store features and labels as they are implemented in the trade module`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* Same number of lines in output as in input file`

refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`## Train prediction models`

rename scripts 2022-08-27 13:25:07 +02:00			Execute: `python -m scripts.train -c config.json`
configuration options for train signal models 2021-10-10 11:47:46 +02:00
			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder
rename scripts 2022-08-27 13:25:07 +02:00			* Hyper-parameters: currently specified in code in `train.py`
configuration options for train signal models 2021-10-10 11:47:46 +02:00			`* Input: feature matrix`
			`* Output: train model files`
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00
			`Purpose: Train models using the latest data (feature matrix) so that these models can be used for prediction in the service`

			`Notes:`
			`* There can be many predicted features and models, for example, for spot and future markets or based on different prediction algorithms or historic horizons`
			`* The procedure will consume feature matrix and hence the following files should be updated: source data, merge files, generate features (no need to generate rolling features).`
			`* The generated models have to be copied to the folder where they are found by the signal/trade server`

scripts improvements 2021-09-09 08:20:29 +02:00			`## Generate rolling predictions`
split readme into documentation files 2021-08-29 17:28:04 +02:00
rename scripts 2022-08-27 13:25:07 +02:00			Execute: `python -m scripts.predict_rolling -c config.json`
split readme into documentation files 2021-08-29 17:28:04 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder
rename scripts 2022-08-27 13:25:07 +02:00			* Hyper-parameters: currently specified in code in `predict_rolling.py`
configuration options for train signal models 2021-10-10 11:47:46 +02:00			`* Input data: feature matrix`
			`* Output: data file with rolling predictions`
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00
			`Purpose: Simulate train-predict step by moving in time as if we ware doing it in real service, that is, add a new data batch, use the available data for training a new model, use this model for predictions, save these predictions, add new data batch and so on`

scripts improvements 2021-09-09 08:20:29 +02:00			Generate rolling predictions. Here we train a model using previous data less frequently, say, once per day or week, but use much more previous data than in typical window-based features. We apply then one constant model to predict values for the future time until it is re-trained again using the newest data. (If the re-train frequency is equal to sample rate, that is, we do it for each new row, then we get normal window-based derived feature with large window sizes.) Each feature is based on some algorithm with some hyper-parameters and some history length. This procedure does not choose best hyper-parameters - for that purpose we need some other procedure, which will optimize the predicted values with the real ones. Normally, the target values of these features are directly related to what we really want to predict, that is, to some label. Output of this procedure is same file (feature matrix) with additional predicted features (scores). This file however will be much shorter because we need some quite long history for some features (say, 1 year). Note that for applying rolling predictions, we have to know hyper-parameters which can be found by a simpler procedure.
split readme into documentation files 2021-08-29 17:28:04 +02:00
			`Notes:`
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`* Many train-predict cycles are executed and hence it takes significant time`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* Prerequisite: We already have to know the best prediction model(s) and its best parameters`
			`* There can be several models used for rolling predictions`
			`* Essentially, the predicting models are treated here as (more complex) feature definitions`
			`* Choose the best model and its parameters using grid search (below)`
scripts improvements 2021-09-09 08:20:29 +02:00			`* The results of this step are consumed by signal generator and backtesting`
split readme into documentation files 2021-08-29 17:28:04 +02:00
scripts improvements 2021-09-09 08:20:29 +02:00			`## Train signal models`
split readme into documentation files 2021-08-29 17:28:04 +02:00
propagate name change of train_signals to configs and docs 2025-04-17 17:36:48 +02:00			Execute: `python -m scripts.simulate -c config.json`
configuration options for train signal models 2021-10-10 11:47:46 +02:00
			`Parameters:`
			* `symbol` pair to load
			* `data_folder` data folder
rename scripts 2022-08-27 13:25:07 +02:00			* Hyper-parameters: currently specified in code in `predict_rolling.py`
configuration options for train signal models 2021-10-10 11:47:46 +02:00			`* Input data: rolling predictions`
			`* Output: table with signal generation parameters and the corresponding performance`
split readme into documentation files 2021-08-29 17:28:04 +02:00
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00			`Purpose: Find the best parameters for signal generation like score thresholds by analyzing the results of rolling predictions`

scripts improvements 2021-09-09 08:20:29 +02:00			`The input is a feature matrix with all scores (predicted features). Our goal is to define a feature the output of which will be directly used for buy/sell decisions. We need search for the best hyper-parameters starting from simple score threshold and ending with some data mining algorithm.`
split readme into documentation files 2021-08-29 17:28:04 +02:00
			`Notes:`
configuration options for train signal models 2021-10-10 11:47:46 +02:00			`* Important: we use a procedure for final score computation which must be the same as in the service during online signal generation`
split readme into documentation files 2021-08-29 17:28:04 +02:00			`* We consume the results of rolling predictions`
			`* We assume that rolling prediction produce many highly informative features`
			`* The grid search (brute force) of this step has to test our trading strategy using back testing as (direct) metric. In other words, trading performance on historic data is our metric for brute force or simple ML`
			`* Normally the result is some thresholds or some simple ML model`
			`* Important: The results of this step are consumed in the production service to generate signals`
refactor scripts to allow for better configuration and different trade pairs 2021-10-09 14:01:10 +02:00
			`## (Grid) search for best parameters of and/or best prediction models`

			Execute: `python -m scripts.grid_search`

			`Purpose: It is a conventional procedure for hyper-parameter optimization by searching in the space of hyper-parameters. It can be any other procedure. The best hyper-parameters found are copied then to the scripts where they are used like train models or rolling predictions.`

			`Notes:`
			`* The results are consumed by the rolling prediction step`
			`* There can be many algorithms and many historic horizons or input feature set`