Provider Divergence Lab

A scientific instrument for measuring inter-provider disagreement on candlestick chart interpretation. Version 1.0 — 2026-06-12.

1. The Question

Retail traders routinely look at the "same" market across several brokers, charting platforms, and binary-option vendors and assume the candles are interchangeable. They are not. Different feeds aggregate ticks on slightly different boundaries, apply different tick-rounding, and sometimes use entirely different reference markets (e.g. PocketOption's synthetic forex). The resulting candle wicks, bodies, and pattern shapes diverge — often subtly, sometimes dramatically.

This lab answers one narrow, falsifiable question: given an LLM trained to read candlestick charts, how often do two different providers of the same underlying yield different verdicts?

2. Scope & Non-Goals

In scope: measuring direction, confidence, and pattern-label disagreement across N ≥ 2 simultaneous provider screenshots of the same symbol.
Out of scope: placing trades, recommending directions, predicting outcomes, calibrating confidence. The lab makes zero bets. That isolation is intentional — the binary exhaustion lab handles wager logic, and conflating the two would contaminate both.

3. Per-Tick Discrepancy Index

For a tick with n providers we form all C(n, 2) unordered pairs and compute three independent metrics in [0, 1]:

d_direction = (# pairs with different direction labels) / C(n, 2)

d_confidence = (1 / C(n, 2)) · Σ |conf_i − conf_j|

d_pattern = 0 if all providers report identical pattern label,
1 otherwise; None if any provider lacks a label

We deliberately do not collapse these into a single number. Direction-disagreement is a fundamentally different signal from confidence-disagreement (one provider can be very-strong-LONG while another is very-strong-LONG-too — that's d_dir = 0, d_conf large, which means the LLM's confidence calibration is provider-dependent, a genuinely interesting finding that gets erased by averaging).

4. Session-Cumulative κ

Raw agreement percent is dangerous. If both providers default to "neutral" 70% of the time, they look 70% agreed even on pure noise. The standard correction is Cohen's κ (Cohen 1960) for two raters and Fleiss' κ (Fleiss 1971) for ≥ 3, both implemented in divergence_lab_service.py without external dependencies.

4.1 Cohen's κ

κ = (p_o − p_e) / (1 − p_e)

where p_o = observed agreement, and p_e the chance agreement assuming each rater's label is drawn independently from its own empirical distribution.

4.2 Fleiss' κ

Generalization of Cohen's κ to n raters per subject. For each tick i we count how many of the n raters chose each direction j:

P_i = (1 / n(n−1)) · (Σ_j n_ij² − n)

P̄ = mean P_i across all N ticks
P̄_e = Σ_j p_j²
κ = (P̄ − P̄_e) / (1 − P̄_e)

4.3 Interpretation (Landis & Koch 1977)

κ range	Interpretation
< 0.00	poor (worse than chance)
0.00 – 0.20	slight
0.21 – 0.40	fair
0.41 – 0.60	moderate
0.61 – 0.80	substantial
0.81 – 1.00	almost perfect

5. Synchronization Discipline

A measurement of disagreement is only meaningful if the providers were looking at the same moment in time. We enforce this with three hard rules:

One persistent <video> element per source, bound to its getDisplayMedia stream for the whole session.
Every frame arrival stamps lastFrameTs via requestVideoFrameCallback. A frame older than 250 ms is excluded from a synchronized tick.
The server ignores the client-supplied tick_ts_ms for research aggregation, using its own wall-clock instead. The client stamp is retained as client_tick_ts_ms for forensic audit.

Known limitation: sources still cannot be guaranteed sub-frame-accurate without WebCodecs, and provider feed latency is itself provider-dependent. The Discrepancy Index therefore conflates "different candle painting" with "different feed latency". For this version we treat both as legitimate inputs to disagreement, since both affect what a real trader would have seen at that instant.

6. Sanity Test

To validate that the pipeline is wired correctly, point two browser tabs at the same chart on the same provider and capture both. After ~30 ticks we should observe:

d_direction ≈ 0
d_confidence ≈ 0 (modulo tiny LLM stochasticity)
Cohen's κ ≥ 0.9

If we instead measure significant divergence, the LLM call is non- deterministic enough to be the dominant source of "disagreement" and any across-provider result would be untrustworthy until the LLM is pinned (temperature 0, fixed seed, etc.).

7. References

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.