intelligent-trading-bot/docs/scripts.md
2026-04-20 18:59:14 +02:00

8.7 KiB

Scripts

Batch scripts are intended for analyzing data in files as opposed to incremental analysis of data in-memory. One script has one purpose. It typically loads data from one input file, processes the data, and stores the result in an output file. Having multiple scripts simplifies the development of complex data processing pipelines consisting of several steps.

There are the following major purposes of scripts:

  • Data retrieval (data collectors) and merge of different data sources
  • Evaluation of features and producing new derived columns
  • Training machine learning models
  • Executing output adapters
  • Simulation and backtesting

File names can be specified in these attributes of the configuration:

  • merge_file_name: "data.csv",
  • feature_file_name: "features.csv",
  • matrix_file_name: "matrix.csv",
  • predict_file_name: "predictions.csv",
  • signal_file_name: "signals.csv",
  • signal_models_file_name: "signal_models", The file extension is used to determine the format: CSV or parquet.

Download and merge

The download script retrievs data the data sources listed in the data_sources section:

python -m scripts.download -c config.json

Quote name (also subfolder) and file name ("klines" by default) are specified in the entries of the data source. If a source file already exists, then the script will retrieve only latest missing data and append these rows to the file. Otherwise, all available data will be retrieved and stored in a new file.

The merge script is used to combine all source files specified in the data_sources section and store the result in one output file:

python -m scripts.merge -c config.json

The output file name and format is specified in the merge_file_name attribute. The file is stored in the symbol subfolder. Columns will get a prefix (if specified in the data source entry) and all rows will be aligned according to the data frequency. Accordintly, gaps can appear in case of missing data in some data source.

Feature engineering

These scripts are intended for applying feature definitions and generating new columns. Currently features can be defined in several ways and hence there are several groups of feature definitions. There is one script for each such group of features.

Features

Features defined in the feature_sets section are evaluated using the features script:

python -m scripts.features -c config.json

It reads data from its input file specified in the feature_sets parameter ("data" by default), generates features defined in the feature_sets section, and stores the result in the output file specified in the feature_file_name parameter ("features" by default).

Labels

Features defined in the label_sets section are evaluated using the labels script:

python -m scripts.labels -c config.json

It reads its input data from the output of the features script. The result table with appended label columns is stored in the matrix_file_name file ("matrix" by default).

Predict

The predict script evaluates features defined in the train_feature_sets section:

python -m scripts.predict -c config.json

It evaluates these trainable features in predict mode by using existing models (previously trained by the train script). The data is loaded from the output of the labels script. The result is stored in the predict_file_name file ("predictions" by default).

Signals

This script evaluates features defined in the signal_sets section:

python -m scripts.signals -c config.json

Input data is loaded from the predict_file_name file (output of the predict script). The result is stored in the signal_file_name file.

Train machine learning models

The train script runs feature definitions from train_feature_sets section in train mode

python -m scripts.train -c config.json

Running in train mode means that these features use the data to find optimal parameters of the features (models) but do not apply these parameters. The results of the script are model files stored in the MODELS subfoler. These model files will be used by the predict script to generate feature output columns.

Outputs

The output script runs the definitions from the output_sets section:

python -m scripts.output -c config.json

They do not produce new features but rather use the existing data to send notifications, visualizations, do trding etc.

Other scripts

Generate rolling predictions

In real trading, we use machine learning models trained on previous data in order to predict future behavior. The predict_rolling is intended for repeating this sequence:

  • train models using available historic data
  • apply these models to predict future behavior
  • append (small) new data chunk and repeat the train-predict analysis The result is a table with predictions which are generated in small chunks using the models trained on the previous data (which also gradueally grows). Thus we simulate the train-predict steps by moving in time as if we ware doing it in real context. For example, for daily data, we could use monthly data chunks in order to keep our models up-to-date. The script will start from some initial date, say, 1 year ago, and then do 12 iterations by training and predicting for each new month. The result has 1 year of predictions but these predictions are done in a "fair" manner closely to real environment.

The script is executed like other scripts:

python -m scripts.predict_rolling -c config.json

It loads data from the matrix_file_name file ("matrix" by default). The result is stored in the predict_file_name file ("predictions" by default).

The parameters for iterations are specified in the rolling_predict section:

  • data_start (default 0) Trancate input data from start
  • data_end (default null) Trancate input data from end
  • prediction_start First row for starting iterations (end of the first training data), for example, "2020-01-02"
  • prediction_size New data chunk size in rows and the size of each prediction
  • prediction_steps How many iterations to perform, for example, 12 if we want to do 12 train-predict iterations. If None or 0, then from prediction start till the data end
  • use_multiprocessing Whether to use multiprocessing (true or false)
  • max_workers How many workers in case of multiprocessing

Simulation and backtesting

Data analysis in ITB is based on generating new features including features based on machine learning algorithms. This allows us to know more about the future behavior like probablity that the price will be 2% higher during the next week. However, this knowledge by itself cannot be used directly for trading and says almost nothing about trade performance. The main reason is that there can be many possible trade strategies based on the knowledge of future behavior. These trade strategies produce buy and sell signals depending on the current situation on the market as well as the current state of the available assets. The simulation script is intended for filling this gap by evaluating trade performance based on the chosen trade strategy and its parameters.

The simulation (backtesting) parameters are specified in the simulate_model section of the configuration file:

python -m scripts.simulate -c config.json

The script consumes predictions from the predict_file_name file. This file includes all derived features and should be produced by the predict_rolling script.

Note that the script does not find performance for one set of parameters. Rather, it evaluates a (large) set of parameters with the goal to find best performing trade model.

The simulate_model section describes a set of trade models and has the following attributes:

  • data_start Start simulation from this date, for example, "1998-01-05"
  • data_end End simulation with this date. If empty then all data will be used
  • direction "long" or "short"
  • topn_to_store How many best performing trade models (their parameters) to store in the output file
  • signal_generator: Name of the generator which produces the column used to make buy/sell decisions
  • buy_sell_equal true or false. Whether the threshold parameters for buy and sell signals are symmetric
  • grid dictionary with lists. Each combination of values from these lists is considered one model // Attributes are ists or evaluated to lists if string
    • buy_signal_threshold It is a list of values interpreted as buy thresholds to test like [0.1, 0.2, 0.3] If it is a string then it is evaluated as Python code and the result is expected to be a list of values, for example, "np.arange(0.0, 0.5, 0.1).tolist()"
    • sell_signal_threshold Same as the previous parameter but used to find sell signals