A scientific instrument for measuring inter-provider disagreement on candlestick chart interpretation. Version 1.0 — 2026-06-12.
Retail traders routinely look at the "same" market across several brokers, charting platforms, and binary-option vendors and assume the candles are interchangeable. They are not. Different feeds aggregate ticks on slightly different boundaries, apply different tick-rounding, and sometimes use entirely different reference markets (e.g. PocketOption's synthetic forex). The resulting candle wicks, bodies, and pattern shapes diverge — often subtly, sometimes dramatically.
This lab answers one narrow, falsifiable question: given an LLM trained to read candlestick charts, how often do two different providers of the same underlying yield different verdicts?
For a tick with n providers we form all C(n, 2) unordered pairs and compute three independent metrics in [0, 1]:
We deliberately do not collapse these into a single number. Direction-disagreement is a fundamentally different signal from confidence-disagreement (one provider can be very-strong-LONG while another is very-strong-LONG-too — that's d_dir = 0, d_conf large, which means the LLM's confidence calibration is provider-dependent, a genuinely interesting finding that gets erased by averaging).
Raw agreement percent is dangerous. If both providers default to
"neutral" 70% of the time, they look 70% agreed even on pure noise. The
standard correction is Cohen's κ (Cohen 1960) for two raters and Fleiss'
κ (Fleiss 1971) for ≥ 3, both implemented in divergence_lab_service.py
without external dependencies.
where po = observed agreement, and pe the chance agreement assuming each rater's label is drawn independently from its own empirical distribution.
Generalization of Cohen's κ to n raters per subject. For each tick i we count how many of the n raters chose each direction j:
| κ range | Interpretation |
|---|---|
| < 0.00 | poor (worse than chance) |
| 0.00 – 0.20 | slight |
| 0.21 – 0.40 | fair |
| 0.41 – 0.60 | moderate |
| 0.61 – 0.80 | substantial |
| 0.81 – 1.00 | almost perfect |
A measurement of disagreement is only meaningful if the providers were looking at the same moment in time. We enforce this with three hard rules:
<video> element per source, bound
to its getDisplayMedia stream for the whole session.lastFrameTs via
requestVideoFrameCallback. A frame older than 250 ms is
excluded from a synchronized tick.tick_ts_ms for
research aggregation, using its own wall-clock instead. The client
stamp is retained as client_tick_ts_ms for forensic
audit.To validate that the pipeline is wired correctly, point two browser tabs at the same chart on the same provider and capture both. After ~30 ticks we should observe:
d_direction ≈ 0d_confidence ≈ 0 (modulo tiny LLM stochasticity)If we instead measure significant divergence, the LLM call is non- deterministic enough to be the dominant source of "disagreement" and any across-provider result would be untrustworthy until the LLM is pinned (temperature 0, fixed seed, etc.).