mql5/Experts/Advisors/DualEA/docs/Observability-Guide.md

532 lines
15 KiB
Markdown
Raw Permalink Normal View History

2025-10-03 01:38:19 -04:00
# Observability Guide — DualEA System
**Telemetry, monitoring, and debugging reference**
---
## Table of Contents
1. [Telemetry System](#telemetry-system)
2. [Event Types Reference](#event-types-reference)
3. [SystemMonitor](#systemmonitor)
4. [EventBus](#eventbus)
5. [Logging Levels](#logging-levels)
6. [Debugging Techniques](#debugging-techniques)
7. [Performance Monitoring](#performance-monitoring)
---
## Telemetry System
### Overview
DualEA uses **TelemetryStandard** wrapper for unified event logging across both PaperEA and LiveEA.
**Output Formats**:
- MT5 Journal (real-time)
- CSV files (Common/Files/DualEA/telemetry_YYYYMMDD.csv)
- JSON logs (Common/Files/DualEA/logs/trades_YYYYMMDD.json)
### Configuration
```cpp
input bool TelemetryEnable = true; // Enable telemetry
input bool TelemetryVerbose = false; // Verbose logging
input bool TelemetryStandard = true; // Use TelemetryStandard wrapper
input int TelemetryFlushSeconds = 300; // Flush every 5 minutes
```
### Event Structure
**CSV Format**:
```csv
timestamp,event_type,symbol,timeframe,strategy,gate,result,reason,latency_ms,metadata
2025-01-15T10:30:00Z,gate_decision,EURUSD,60,ADXStrategy,signal_rinse,pass,,5,confidence=0.85
2025-01-15T10:30:01Z,gate_decision,EURUSD,60,ADXStrategy,insights,block,below_winrate,2,winrate=0.42
```
**JSON Format** (UnifiedTradeLogger):
```json
{
"timestamp": "2025-01-15T10:30:00Z",
"event_type": "trade_execution",
"symbol": "EURUSD",
"timeframe": 60,
"strategy": "ADXStrategy",
"direction": 1,
"entry": 1.0850,
"sl": 1.0800,
"tp": 1.0950,
"lots": 0.10,
"gates": ["signal_rinse:pass", "market_soap:pass", ...],
"policy": {"confidence": 0.75, "sl_mult": 1.0},
"profit": 15.50,
"r_multiple": 1.2
}
```
---
## Event Types Reference
### Core Events
**gate_decision**
- **When**: Each gate pass/block decision
- **Fields**: `gate`, `result` (pass/block), `reason`, `latency_ms`
- **Example**: `gate_decision,EURUSD,60,ADXStrategy,market_soap,block,high_correlation,8,corr=0.82`
**trade_execution**
- **When**: Order placed (or paper position created)
- **Fields**: `direction`, `entry`, `lots`, `sl`, `tp`, `retcode` (LiveEA only)
- **Example**: `trade_execution,EURUSD,60,ADXStrategy,BUY,1.0850,0.10,1.0800,1.0950,retcode=10009`
**trade_closed**
- **When**: Position closed (SL/TP hit or manual close)
- **Fields**: `entry`, `exit`, `profit`, `r_multiple`, `duration_sec`, `reason`
- **Example**: `trade_closed,EURUSD,60,ADXStrategy,1.0850,1.0950,15.50,1.2,3600,tp_hit`
**policy_load**
- **When**: policy.json loaded or reloaded
- **Fields**: `slices`, `min_confidence`, `file_size`, `load_time_ms`
- **Example**: `policy_load,,,,,45,0.55,12345,25`
**insights_rebuild**
- **When**: insights.json regenerated
- **Fields**: `slices`, `source_trades`, `rebuild_time_sec`
- **Example**: `insights_rebuild,,,,,123,4567,45`
**threshold_adjust**
- **When**: GateLearningSystem adjusts gate thresholds
- **Fields**: `gate`, `old_threshold`, `new_threshold`, `success_rate`
- **Example**: `threshold_adjust,,,,,signal_rinse,0.60,0.58,0.72`
### Risk Events
**risk4_spread**
- **When**: Spread check (Gate 7 or early phase)
- **Fields**: `spread_pips`, `max_allowed`, `result`
- **Example**: `risk4_spread,EURUSD,60,,,,block,spread_too_wide,spread=4.2 max=3.0`
**risk4_margin**
- **When**: Margin level check
- **Fields**: `margin_level`, `min_required`, `result`
- **Example**: `risk4_margin,EURUSD,60,,,,block,margin_low,level=180% min=200%`
**risk4_drawdown**
- **When**: Drawdown check
- **Fields**: `current_dd_pct`, `max_allowed`, `result`
- **Example**: `risk4_drawdown,,,,,12.5,10.0,block`
**risk4_daily_loss**
- **When**: Daily loss circuit breaker
- **Fields**: `daily_pnl_pct`, `max_loss_pct`, `result`
- **Example**: `risk4_daily_loss,,,,,−5.2,−5.0,block`
**risk4_consecutive**
- **When**: Consecutive loss check
- **Fields**: `consecutive_losses`, `max_allowed`, `result`
- **Example**: `risk4_consecutive,,,,,5,5,block`
### Position Manager Events
**pm_corr_sizing**
- **When**: Correlation-adjusted sizing applied
- **Fields**: `avg_corr`, `base_lots`, `adjusted_lots`, `dampening`
- **Example**: `pm_corr_sizing,GBPUSD,60,,0.68,0.10,0.066,0.66`
**pm_corr_floor**
- **When**: Correlation floor applied
- **Fields**: `floor_type` (mult/lots), `floor_value`, `final_lots`
- **Example**: `pm_corr_floor,GBPUSD,60,,mult,0.25,0.025`
**pm_corr_scale_in**
- **When**: Scale-in with correlation adjustment
- **Fields**: Same as `pm_corr_sizing`
**pm_adaptive_sizing**
- **When**: Adaptive sizing based on performance
- **Fields**: `perf_factor`, `base_lots`, `adjusted_lots`
### Selector Events (Phase 5)
**p5_refresh**
- **When**: Strategy selector refreshed
- **Fields**: `strategies_count`, `refresh_time_ms`
**p5_rescore**
- **When**: Strategies rescored
- **Fields**: `strategy`, `old_score`, `new_score`, `reason`
**p5_auto_tune**
- **When**: Selector thresholds auto-tuned
- **Fields**: `threshold_type`, `old_value`, `new_value`
**p5_selector_cfg**
- **When**: Selector configuration changed
- **Fields**: `param`, `old_value`, `new_value`
### Exploration Events
**explore_allow**
- **When**: Exploration trade allowed
- **Fields**: `day_count`, `day_max`, `week_count`, `week_max`
- **Example**: `explore_allow,EURUSD,60,ADXStrategy,1,2,2,3`
**explore_block_day**
- **When**: Daily exploration cap reached
- **Fields**: Same as `explore_allow`
**explore_block_week**
- **When**: Weekly exploration cap reached
- **Fields**: Same as `explore_allow`
### Policy Fallback Events
**policy_fallback**
- **When**: Fallback mode triggered
- **Fields**: `fallback_type` (no_policy/slice_miss), `demo`, `allowed`
- **Example**: `policy_fallback,EURUSD,60,ADXStrategy,no_policy,true,true`
---
## SystemMonitor
### Purpose
Real-time health monitoring and performance tracking for the entire system.
### Initialization
```cpp
input bool MonitorEnable = true;
input int MonitorUpdateSeconds = 60;
input double MonitorHealthThreshold = 70.0;
input bool MonitorAlertOnDegradation = true;
CSystemMonitor* monitor = CSystemMonitor::GetInstance();
```
### Metrics Tracked
**Gate Success Rates**:
```cpp
double GetGateSuccessRate(int gateNum); // 0.0-1.0
```
- Tracks pass/fail ratio per gate over rolling window
- Target: configurable per gate (default 0.7-0.8)
**Processing Times**:
```cpp
double GetAvgGateLatency(int gateNum); // milliseconds
double GetAvgPipelineLatency(); // milliseconds
```
**System Health Score** (0-100):
```cpp
double GetSystemHealthScore();
```
Calculated from:
- Gate success rates vs targets
- Processing latencies vs thresholds
- Memory usage
- Error rates
- Event queue depth
**Degradation Detection**:
- Triggers alert if health score drops below `MonitorHealthThreshold`
- Logs degradation reason (which metric failed)
### Usage Example
```cpp
void OnTimer() {
if(MonitorEnable) {
double health = monitor.GetSystemHealthScore();
if(health < MonitorHealthThreshold) {
Print(StringFormat("[ALERT] System health degraded: %.1f < %.1f",
health, MonitorHealthThreshold));
telemetry.Event("system_alert", "health_degraded", health);
}
// Log metrics
for(int i = 1; i <= 8; i++) {
double successRate = monitor.GetGateSuccessRate(i);
double latency = monitor.GetAvgGateLatency(i);
Print(StringFormat("[MONITOR] Gate %d: SR=%.2f Latency=%.1fms",
i, successRate, latency));
}
}
}
```
---
## EventBus
### Purpose
Cross-component communication with priority-based event logging.
### Configuration
```cpp
input bool EventBusEnable = true;
input int EventBusPriority = 2; // 0=ALL, 1=LOW, 2=MED, 3=HIGH, 4=CRITICAL
input int EventBusMaxQueueSize = 1000;
```
### Priority Levels
- **0 (ALL)**: Debug/verbose events
- **1 (LOW)**: Informational events (gate passes, normal operations)
- **2 (MEDIUM)**: Warnings (gate blocks, threshold adjustments)
- **3 (HIGH)**: Errors (file I/O failures, policy load failures)
- **4 (CRITICAL)**: System failures (circuit breakers, broker errors)
### Event Emission
```cpp
CEventBus* eventBus = CEventBus::GetInstance();
// Emit event
eventBus.Emit(
EVENT_PRIORITY_MEDIUM, // Priority
"gate_decision", // Event type
symbol, // Symbol
timeframe, // Timeframe
"ADXStrategy", // Strategy
"insights", // Gate/component
"block", // Result
"below_winrate", // Reason
telemetry // Telemetry sink
);
```
### Event Queue
- Events queued if `EventBusEnable=true`
- Filtered by `EventBusPriority` (only events at or above priority level are logged)
- Flushed periodically or on demand
- Max queue size: `EventBusMaxQueueSize` (oldest events dropped if exceeded)
---
## Logging Levels
### Log Level Configuration
```cpp
input int LogLevel = 2; // 0=None, 1=Error, 2=Warning, 3=Info, 4=Debug
input bool LogToFile = true;
input bool LogToJournal = true;
```
### Level Definitions
**0 - None**: No logging (not recommended)
**1 - Error**: Only critical errors
- File I/O failures
- Broker retcode errors
- Parse errors (policy.json, insights.json)
**2 - Warning** (Default): Errors + warnings
- Gate blocks
- Policy fallbacks
- Circuit breaker trips
- Threshold adjustments
**3 - Info**: Errors + warnings + informational
- Trade executions
- Gate passes
- Policy loads
- Insights rebuilds
**4 - Debug**: All events including verbose details
- Every gate decision with full context
- Signal generation details
- File I/O operations
- Learning system updates
### Log Output
**MT5 Journal**:
```
2025.01.15 10:30:00.123 DualEA (EURUSD,H1) [GATE] signal_rinse: pass - confidence=0.85
2025.01.15 10:30:00.156 DualEA (EURUSD,H1) [GATE] insights: block - below_winrate (WR=0.42 < 0.50)
```
**File Logs** (if `LogToFile=true`):
- Path: `Common/Files/DualEA/logs/ea_YYYYMMDD.log`
- Rotation: Daily, keeps last 30 days
- Format: Same as Journal with timestamp prefix
---
## Debugging Techniques
### Debug Flags
Enable specific debug output:
```cpp
input bool DebugTrailing = false; // Trailing stop operations
input bool DebugGates = false; // Gate decisions
input bool DebugStrategies = false; // Strategy signals
input bool DebugPolicy = false; // Policy loading/application
input bool DebugInsights = false; // Insights loading
input bool DebugKB = false; // KB writes
```
### Shadow Mode (NoConstraintsMode)
Test gate changes without blocking trades:
```cpp
input bool NoConstraintsMode = true; // Enable shadow mode
input bool TelemetryVerbose = true; // Capture all shadow decisions
```
**Shadow Logging**:
```
[SHADOW] gate=insights result=block reason=below_winrate (but allowing due to NoConstraintsMode)
```
All gate decisions logged via `LogGatingShadow()` but trades proceed regardless.
### Telemetry Analysis
**Query gate block reasons**:
```powershell
Import-Csv telemetry_20250115.csv |
Where-Object {$_.event_type -eq 'gate_decision' -and $_.result -eq 'block'} |
Group-Object reason |
Select-Object Count, Name |
Sort-Object Count -Descending
```
**Analyze latencies**:
```powershell
Import-Csv telemetry_20250115.csv |
Where-Object {$_.event_type -eq 'gate_decision'} |
Measure-Object latency_ms -Average -Maximum
```
### Common Issues
**High gate block rate**:
1. Check `SystemMonitor.GetGateSuccessRate(gateNum)`
2. Review telemetry for most common `reason`
3. Adjust threshold or disable gate temporarily
4. Use shadow mode to observe impact
**Policy fallback always triggering**:
1. Check if policy.json exists: `Common/Files/DualEA/policy.json`
2. Verify `policyData.slices > 0`
3. Check slice coverage for your symbols/timeframes/strategies
4. Review `DebugPolicy=true` logs
**Exploration caps reached quickly**:
1. Check counters: `explore_counts.csv`, `explore_counts_day.csv`
2. Increase caps or reset counters (delete files)
3. Verify `NoConstraintsMode=false` (otherwise caps bypassed)
**KB writes failing**:
1. Enable `DebugKB=true`
2. Check Common Files path exists: `FolderCreate("DualEA", FILE_COMMON)`
3. Verify file permissions
4. Check disk space
---
## Performance Monitoring
### Latency Targets
**Per-Component**:
- Signal generation: <5ms
- Single gate: <10ms
- 8-stage gates total: <50ms
- Policy/insights lookup: <2ms (cached)
- KB write: <10ms
- Telemetry emit: <1ms
- **Total pipeline**: <100ms
**Monitoring**:
```cpp
double pipelineStart = GetTickCount();
// ... pipeline execution ...
double pipelineEnd = GetTickCount();
double latency = pipelineEnd - pipelineStart;
if(latency > 100) {
Print(StringFormat("[PERF] Pipeline slow: %.1fms", latency));
telemetry.Event("performance_alert", "pipeline_slow", latency);
}
```
### Memory Usage
**Tracking**:
```cpp
long memUsed = TerminalInfoInteger(TERMINAL_MEMORY_USED); // KB
int handles = IndicatorsTotal();
if(memUsed > 50000) { // 50MB
Print(StringFormat("[PERF] High memory: %d KB", memUsed));
}
```
**Expected Footprint**:
- Single EA instance: ~1.5MB
- With all strategies loaded: ~2MB
- High telemetry volume: ~5MB
### File I/O Monitoring
**Batch Writes**:
- KB writes: Buffer every 10 trades
- Features export: Buffer every 50 trades
- Telemetry flush: Every 5 minutes or 1000 events
**Monitoring**:
```cpp
int writeCount = 0;
double totalWriteTime = 0;
void WriteToKB(...) {
double start = GetTickCount();
// ... write operation ...
double end = GetTickCount();
writeCount++;
totalWriteTime += (end - start);
if(writeCount % 100 == 0) {
double avgTime = totalWriteTime / writeCount;
Print(StringFormat("[PERF] KB writes: %d total, %.2fms avg",
writeCount, avgTime));
}
}
```
---
## Telemetry Best Practices
1. **Use appropriate log levels**: Debug only during development, Info/Warning for production
2. **Enable TelemetryStandard**: Provides consistent formatting across EAs
3. **Flush regularly**: Don't let event queue grow unbounded
4. **Rotate logs**: Keep file sizes manageable (<100MB)
5. **Monitor health score**: Set up alerts for degradation
6. **Use shadow mode**: Test gate changes safely
7. **Archive old telemetry**: Compress and move to cold storage after 30 days
---
**See Also:**
- [Configuration-Reference.md](Configuration-Reference.md) - Telemetry input parameters
- [Execution-Pipeline.md](Execution-Pipeline.md) - When events are emitted
- [Policy-Exploration-Guide.md](Policy-Exploration-Guide.md) - Policy/exploration events