15 KiB
Observability Guide — DualEA System
Telemetry, monitoring, and debugging reference
Table of Contents
- Telemetry System
- Event Types Reference
- SystemMonitor
- EventBus
- Logging Levels
- Debugging Techniques
- Performance Monitoring
Telemetry System
Overview
DualEA uses TelemetryStandard wrapper for unified event logging across both PaperEA and LiveEA.
Output Formats:
- MT5 Journal (real-time)
- CSV files (Common/Files/DualEA/telemetry_YYYYMMDD.csv)
- JSON logs (Common/Files/DualEA/logs/trades_YYYYMMDD.json)
Configuration
input bool TelemetryEnable = true; // Enable telemetry
input bool TelemetryVerbose = false; // Verbose logging
input bool TelemetryStandard = true; // Use TelemetryStandard wrapper
input int TelemetryFlushSeconds = 300; // Flush every 5 minutes
Event Structure
CSV Format:
timestamp,event_type,symbol,timeframe,strategy,gate,result,reason,latency_ms,metadata
2025-01-15T10:30:00Z,gate_decision,EURUSD,60,ADXStrategy,signal_rinse,pass,,5,confidence=0.85
2025-01-15T10:30:01Z,gate_decision,EURUSD,60,ADXStrategy,insights,block,below_winrate,2,winrate=0.42
JSON Format (UnifiedTradeLogger):
{
"timestamp": "2025-01-15T10:30:00Z",
"event_type": "trade_execution",
"symbol": "EURUSD",
"timeframe": 60,
"strategy": "ADXStrategy",
"direction": 1,
"entry": 1.0850,
"sl": 1.0800,
"tp": 1.0950,
"lots": 0.10,
"gates": ["signal_rinse:pass", "market_soap:pass", ...],
"policy": {"confidence": 0.75, "sl_mult": 1.0},
"profit": 15.50,
"r_multiple": 1.2
}
Event Types Reference
Core Events
gate_decision
- When: Each gate pass/block decision
- Fields:
gate,result(pass/block),reason,latency_ms - Example:
gate_decision,EURUSD,60,ADXStrategy,market_soap,block,high_correlation,8,corr=0.82
trade_execution
- When: Order placed (or paper position created)
- Fields:
direction,entry,lots,sl,tp,retcode(LiveEA only) - Example:
trade_execution,EURUSD,60,ADXStrategy,BUY,1.0850,0.10,1.0800,1.0950,retcode=10009
trade_closed
- When: Position closed (SL/TP hit or manual close)
- Fields:
entry,exit,profit,r_multiple,duration_sec,reason - Example:
trade_closed,EURUSD,60,ADXStrategy,1.0850,1.0950,15.50,1.2,3600,tp_hit
policy_load
- When: policy.json loaded or reloaded
- Fields:
slices,min_confidence,file_size,load_time_ms - Example:
policy_load,,,,,45,0.55,12345,25
insights_rebuild
- When: insights.json regenerated
- Fields:
slices,source_trades,rebuild_time_sec - Example:
insights_rebuild,,,,,123,4567,45
threshold_adjust
- When: GateLearningSystem adjusts gate thresholds
- Fields:
gate,old_threshold,new_threshold,success_rate - Example:
threshold_adjust,,,,,signal_rinse,0.60,0.58,0.72
Risk Events
risk4_spread
- When: Spread check (Gate 7 or early phase)
- Fields:
spread_pips,max_allowed,result - Example:
risk4_spread,EURUSD,60,,,,block,spread_too_wide,spread=4.2 max=3.0
risk4_margin
- When: Margin level check
- Fields:
margin_level,min_required,result - Example:
risk4_margin,EURUSD,60,,,,block,margin_low,level=180% min=200%
risk4_drawdown
- When: Drawdown check
- Fields:
current_dd_pct,max_allowed,result - Example:
risk4_drawdown,,,,,12.5,10.0,block
risk4_daily_loss
- When: Daily loss circuit breaker
- Fields:
daily_pnl_pct,max_loss_pct,result - Example:
risk4_daily_loss,,,,,−5.2,−5.0,block
risk4_consecutive
- When: Consecutive loss check
- Fields:
consecutive_losses,max_allowed,result - Example:
risk4_consecutive,,,,,5,5,block
Position Manager Events
pm_corr_sizing
- When: Correlation-adjusted sizing applied
- Fields:
avg_corr,base_lots,adjusted_lots,dampening - Example:
pm_corr_sizing,GBPUSD,60,,0.68,0.10,0.066,0.66
pm_corr_floor
- When: Correlation floor applied
- Fields:
floor_type(mult/lots),floor_value,final_lots - Example:
pm_corr_floor,GBPUSD,60,,mult,0.25,0.025
pm_corr_scale_in
- When: Scale-in with correlation adjustment
- Fields: Same as
pm_corr_sizing
pm_adaptive_sizing
- When: Adaptive sizing based on performance
- Fields:
perf_factor,base_lots,adjusted_lots
Selector Events (Phase 5)
p5_refresh
- When: Strategy selector refreshed
- Fields:
strategies_count,refresh_time_ms
p5_rescore
- When: Strategies rescored
- Fields:
strategy,old_score,new_score,reason
p5_auto_tune
- When: Selector thresholds auto-tuned
- Fields:
threshold_type,old_value,new_value
p5_selector_cfg
- When: Selector configuration changed
- Fields:
param,old_value,new_value
Exploration Events
explore_allow
- When: Exploration trade allowed
- Fields:
day_count,day_max,week_count,week_max - Example:
explore_allow,EURUSD,60,ADXStrategy,1,2,2,3
explore_block_day
- When: Daily exploration cap reached
- Fields: Same as
explore_allow
explore_block_week
- When: Weekly exploration cap reached
- Fields: Same as
explore_allow
Policy Fallback Events
policy_fallback
- When: Fallback mode triggered
- Fields:
fallback_type(no_policy/slice_miss),demo,allowed - Example:
policy_fallback,EURUSD,60,ADXStrategy,no_policy,true,true
SystemMonitor
Purpose
Real-time health monitoring and performance tracking for the entire system.
Initialization
input bool MonitorEnable = true;
input int MonitorUpdateSeconds = 60;
input double MonitorHealthThreshold = 70.0;
input bool MonitorAlertOnDegradation = true;
CSystemMonitor* monitor = CSystemMonitor::GetInstance();
Metrics Tracked
Gate Success Rates:
double GetGateSuccessRate(int gateNum); // 0.0-1.0
- Tracks pass/fail ratio per gate over rolling window
- Target: configurable per gate (default 0.7-0.8)
Processing Times:
double GetAvgGateLatency(int gateNum); // milliseconds
double GetAvgPipelineLatency(); // milliseconds
System Health Score (0-100):
double GetSystemHealthScore();
Calculated from:
- Gate success rates vs targets
- Processing latencies vs thresholds
- Memory usage
- Error rates
- Event queue depth
Degradation Detection:
- Triggers alert if health score drops below
MonitorHealthThreshold - Logs degradation reason (which metric failed)
Usage Example
void OnTimer() {
if(MonitorEnable) {
double health = monitor.GetSystemHealthScore();
if(health < MonitorHealthThreshold) {
Print(StringFormat("[ALERT] System health degraded: %.1f < %.1f",
health, MonitorHealthThreshold));
telemetry.Event("system_alert", "health_degraded", health);
}
// Log metrics
for(int i = 1; i <= 8; i++) {
double successRate = monitor.GetGateSuccessRate(i);
double latency = monitor.GetAvgGateLatency(i);
Print(StringFormat("[MONITOR] Gate %d: SR=%.2f Latency=%.1fms",
i, successRate, latency));
}
}
}
EventBus
Purpose
Cross-component communication with priority-based event logging.
Configuration
input bool EventBusEnable = true;
input int EventBusPriority = 2; // 0=ALL, 1=LOW, 2=MED, 3=HIGH, 4=CRITICAL
input int EventBusMaxQueueSize = 1000;
Priority Levels
- 0 (ALL): Debug/verbose events
- 1 (LOW): Informational events (gate passes, normal operations)
- 2 (MEDIUM): Warnings (gate blocks, threshold adjustments)
- 3 (HIGH): Errors (file I/O failures, policy load failures)
- 4 (CRITICAL): System failures (circuit breakers, broker errors)
Event Emission
CEventBus* eventBus = CEventBus::GetInstance();
// Emit event
eventBus.Emit(
EVENT_PRIORITY_MEDIUM, // Priority
"gate_decision", // Event type
symbol, // Symbol
timeframe, // Timeframe
"ADXStrategy", // Strategy
"insights", // Gate/component
"block", // Result
"below_winrate", // Reason
telemetry // Telemetry sink
);
Event Queue
- Events queued if
EventBusEnable=true - Filtered by
EventBusPriority(only events at or above priority level are logged) - Flushed periodically or on demand
- Max queue size:
EventBusMaxQueueSize(oldest events dropped if exceeded)
Logging Levels
Log Level Configuration
input int LogLevel = 2; // 0=None, 1=Error, 2=Warning, 3=Info, 4=Debug
input bool LogToFile = true;
input bool LogToJournal = true;
Level Definitions
0 - None: No logging (not recommended)
1 - Error: Only critical errors
- File I/O failures
- Broker retcode errors
- Parse errors (policy.json, insights.json)
2 - Warning (Default): Errors + warnings
- Gate blocks
- Policy fallbacks
- Circuit breaker trips
- Threshold adjustments
3 - Info: Errors + warnings + informational
- Trade executions
- Gate passes
- Policy loads
- Insights rebuilds
4 - Debug: All events including verbose details
- Every gate decision with full context
- Signal generation details
- File I/O operations
- Learning system updates
Log Output
MT5 Journal:
2025.01.15 10:30:00.123 DualEA (EURUSD,H1) [GATE] signal_rinse: pass - confidence=0.85
2025.01.15 10:30:00.156 DualEA (EURUSD,H1) [GATE] insights: block - below_winrate (WR=0.42 < 0.50)
File Logs (if LogToFile=true):
- Path:
Common/Files/DualEA/logs/ea_YYYYMMDD.log - Rotation: Daily, keeps last 30 days
- Format: Same as Journal with timestamp prefix
Debugging Techniques
Debug Flags
Enable specific debug output:
input bool DebugTrailing = false; // Trailing stop operations
input bool DebugGates = false; // Gate decisions
input bool DebugStrategies = false; // Strategy signals
input bool DebugPolicy = false; // Policy loading/application
input bool DebugInsights = false; // Insights loading
input bool DebugKB = false; // KB writes
Shadow Mode (NoConstraintsMode)
Test gate changes without blocking trades:
input bool NoConstraintsMode = true; // Enable shadow mode
input bool TelemetryVerbose = true; // Capture all shadow decisions
Shadow Logging:
[SHADOW] gate=insights result=block reason=below_winrate (but allowing due to NoConstraintsMode)
All gate decisions logged via LogGatingShadow() but trades proceed regardless.
Telemetry Analysis
Query gate block reasons:
Import-Csv telemetry_20250115.csv |
Where-Object {$_.event_type -eq 'gate_decision' -and $_.result -eq 'block'} |
Group-Object reason |
Select-Object Count, Name |
Sort-Object Count -Descending
Analyze latencies:
Import-Csv telemetry_20250115.csv |
Where-Object {$_.event_type -eq 'gate_decision'} |
Measure-Object latency_ms -Average -Maximum
Common Issues
High gate block rate:
- Check
SystemMonitor.GetGateSuccessRate(gateNum) - Review telemetry for most common
reason - Adjust threshold or disable gate temporarily
- Use shadow mode to observe impact
Policy fallback always triggering:
- Check if policy.json exists:
Common/Files/DualEA/policy.json - Verify
policyData.slices > 0 - Check slice coverage for your symbols/timeframes/strategies
- Review
DebugPolicy=truelogs
Exploration caps reached quickly:
- Check counters:
explore_counts.csv,explore_counts_day.csv - Increase caps or reset counters (delete files)
- Verify
NoConstraintsMode=false(otherwise caps bypassed)
KB writes failing:
- Enable
DebugKB=true - Check Common Files path exists:
FolderCreate("DualEA", FILE_COMMON) - Verify file permissions
- Check disk space
Performance Monitoring
Latency Targets
Per-Component:
- Signal generation: <5ms
- Single gate: <10ms
- 8-stage gates total: <50ms
- Policy/insights lookup: <2ms (cached)
- KB write: <10ms
- Telemetry emit: <1ms
- Total pipeline: <100ms
Monitoring:
double pipelineStart = GetTickCount();
// ... pipeline execution ...
double pipelineEnd = GetTickCount();
double latency = pipelineEnd - pipelineStart;
if(latency > 100) {
Print(StringFormat("[PERF] Pipeline slow: %.1fms", latency));
telemetry.Event("performance_alert", "pipeline_slow", latency);
}
Memory Usage
Tracking:
long memUsed = TerminalInfoInteger(TERMINAL_MEMORY_USED); // KB
int handles = IndicatorsTotal();
if(memUsed > 50000) { // 50MB
Print(StringFormat("[PERF] High memory: %d KB", memUsed));
}
Expected Footprint:
- Single EA instance: ~1.5MB
- With all strategies loaded: ~2MB
- High telemetry volume: ~5MB
File I/O Monitoring
Batch Writes:
- KB writes: Buffer every 10 trades
- Features export: Buffer every 50 trades
- Telemetry flush: Every 5 minutes or 1000 events
Monitoring:
int writeCount = 0;
double totalWriteTime = 0;
void WriteToKB(...) {
double start = GetTickCount();
// ... write operation ...
double end = GetTickCount();
writeCount++;
totalWriteTime += (end - start);
if(writeCount % 100 == 0) {
double avgTime = totalWriteTime / writeCount;
Print(StringFormat("[PERF] KB writes: %d total, %.2fms avg",
writeCount, avgTime));
}
}
Telemetry Best Practices
- Use appropriate log levels: Debug only during development, Info/Warning for production
- Enable TelemetryStandard: Provides consistent formatting across EAs
- Flush regularly: Don't let event queue grow unbounded
- Rotate logs: Keep file sizes manageable (<100MB)
- Monitor health score: Set up alerts for degradation
- Use shadow mode: Test gate changes safely
- Archive old telemetry: Compress and move to cold storage after 30 days
See Also:
- Configuration-Reference.md - Telemetry input parameters
- Execution-Pipeline.md - When events are emitted
- Policy-Exploration-Guide.md - Policy/exploration events