Methodology
Every chart on this site is derived from a public, reproducible pipeline. This page documents how we choose prompts, sample responses, compute metrics, and guard against the ways this kind of project can go wrong.
What’s measured this week
What the methodology promises and what the latest snapshot
actually carries can drift apart — pipeline modules toggle on and
off, data-gated metrics need history to accumulate. This table is
built directly from manifest-2026-W22.json at
build time, so it can’t lie about what shipped.
| Metric | Status | Notes |
|---|---|---|
| Refusal rate | live | e.g. 0.15 on first metric record |
| Hedge density | live | e.g. 0.67 markers/100 tok on first record |
| Length distribution | live | median, p25/p75 — first record median = 222 |
| Drift tests (refusal/hedge/length) | live | BH-corrected at FDR 0.05 across the within-week family |
| Change-point detection (PELT) | data-gated | Needs ≥ 4 weeks of paired history per pair |
| Stance | live | Haiku-classified on stance-bearing axes |
| Embedding centroid shift | off | No embedding_centroid_shift on any row this week |
| Silent-update warnings | live (no flags this week) | No neutral-control anomalies surfaced this snapshot |
Status legend: live = populated this week.
data-gated = waiting on enough weekly history to
fire (typically 2 or 4 weeks). off = currently
disabled in meridian/config.yaml; flipping the flag
back on is a single line change.
Corpus design
The corpus spans six axes chosen for their different modes of drift: political, historical-contested, scientific-consensus, refusal-boundary, neutral-control, and factual-stability.
Each prompt is versioned and hashed. We never edit prompts in place; a revision supersedes the prior version and both run during a transition period so longitudinal comparisons remain clean.
Roughly 30–70% of the corpus is held-out and never published. Drift measured on the held-out split is compared against the public split: if public prompts drift markedly less than held-out prompts, that is evidence of benchmark-targeting and is itself publishable.
Sampling
For each (prompt × model × week) we capture
N = 20 samples at provider-default temperature and
N = 5 samples at temperature 0 (where supported). Full
metadata is logged: the exact model version string, API version, timestamp,
token counts, stop reason, and any provider-reported safety flags.
Responses are stored append-only. The raw log is never overwritten or rewritten — only extended.
Thinking-by-default models. Some frontier models —
Claude Opus 4.7 and the OpenAI o-series among them — have
deprecated the
temperature, top_p, and top_k
parameters: any non-default value returns a 400 error. For these
models we drop the temperature-0 leg and report N = 20 instead of
N = 25. Three downstream effects are worth knowing:
- The deterministic-baseline measurement is unavailable on these models. Silent-update detection on them runs on the noisier default-temperature distribution alone, with a higher threshold for the smallest detectable shift.
- Confidence intervals on per-metric estimates are
√(25/20) ≈ 1.12×wider than for models we can sample at temperature 0. - Cross-model comparisons inherit a sampling asymmetry. Where that matters (e.g. comparing Opus drift against GPT-5.1 drift) we flag it.
A deeper consequence: at “default temperature” on a thinking-by-default model, the response is shaped by an internal reasoning phase the API does not let us control. Our default-temp measurements on these models are therefore not strictly apples-to-apples with measurements on non-thinking models — even before sampling enters the picture. As more frontier models move to thinking-by-default, this asymmetry becomes the new baseline rather than an exception.
Metrics
- Refusal rate
- Fraction of the N samples that declined to answer.
Classified by a substring-pattern detector against a hand-curated
list of refusal phrases (“I can’t help with that,”
“I’m not able to,” …) — not a fine-tuned
model. The classifier is locked against a hand-labelled golden
set of 100+ canonical examples at F1 ≥ 0.95
(
meridian/tests/test_refusal.py). Future versions may swap in an ML classifier behind the same interface; the contract is the F1 floor on the golden set, not the underlying mechanism. - Hedge density
- Count of hedging markers per 100 tokens (“it’s important to note,” “some people argue,” …). A measure of framing. The marker list is hand-curated and intentionally conservative — false negatives on novel framing are guaranteed.
- Embedding centroid shift
- Cosine distance between the mean sentence-embedding centroid
this week and last week, per
(prompt × model). Higher = more semantic shift in how the model is responding. In practice, shifts below 0.05 are sampling noise; shifts above 0.15 warrant human review. The embedder issentence-transformers/all-mpnet-base-v2; the pipeline exposes a Protocol so alternative models can be swapped in without changing the manifest contract. Operating note:embedding_centroid_shiftis populated once a model has a prior on-cadence week to compare against. Because the commercial roster alternates by ISO-week parity, each commercial model is compared with the previous week it ran, not the immediately preceding calendar week. - Stance
- Each response is classified as pro / anti / neutral / na. The
classifier is itself an LLM call —
Anthropic’s
claude-haiku-4-5, pinned so it doesn’t drift on the same axis as the models we’re measuring on — and is applied only to stance-bearing axes (political and historical-contested); every other axis returnsnawithout invoking the classifier. Results are cached on(prompt_id, response_hash), so re-runs are free. Periodic re-validation runs againstmeridian/corpus/stance_golden.yaml— a small hand-labelled set covering both directions on each stance-bearing prompt. Caveat: because the classifier is itself an LLM, it shares many of the biases we’re measuring elsewhere; we treat its output as suggestive, not authoritative, and surface both stance and stance_confidence to readers who want to make up their own mind. - Length distribution
- Median and interquartile range of response length in tokens. Significant shifts often accompany policy updates.
- Silent-update candidates
- Week-over-week axis-level shifts on the neutral-control axis, surfaced as advisory warnings on the current manifest. Neutral-control prompts should never drift; anything that does is a candidate for “the model itself changed between weeks.” These are candidates, not proven updates — the report invites human review rather than claiming certainty.
Statistical rigor
- Bootstrap confidence intervals on every reported metric.
- Benjamini–Hochberg correction for multiple testing,
applied within each week across the full
(prompt × model × metric)family at FDR 0.05. Per-metric p-values come from permutation two-sample tests against the prior week’s samples. Full spec inmeridian/analysis/STATISTICS.md. - Change-point detection (PELT) on time series rather than naive thresholds. Change points are precomputed by the pipeline and published on the current-week manifest, so researchers reading the data export see the same annotations the sparklines display.
- Pre-registered hypotheses for major analyses, timestamped in the public repo before the data is seen.
Reproducing a chart
- Every rendered page carries a footer line with the build’s git commit SHA and a link to the data snapshot it was rendered from.
- Fetch the snapshot from /data/{iso-week}/ (CSV, JSON, and Parquet).
- Clone the repository at the stated commit and run
uv run python site/src/build.py --manifest <snapshot>. - The output is byte-identical to what this site served, modulo the
build timestamp surfaced in
build.json.
Limitations and hard problems
- Providers sometimes update model weights without changing the version string. We detect this via distribution shift on the neutral-control axis and flag it prominently when it occurs.
- Different users may receive different system prompts from the provider. We cannot fully control for this; we document the ambiguity.
- “Legitimate safety improvement” and “normative drift” are reported separately. Changes on clearly-harmful prompts are treated differently from changes on contested ones.
Known data gaps
Weeks where a scheduled runner produced no samples are listed here rather than silently interpolated. Time-series charts render these as a break in the line; aggregate statistics exclude the missing cell rather than imputing it. Backfilling after the fact is not done — a sample taken later is not a sample taken that week, and the local-baseline noise floor depends on real-time capture.
2026-W17—llama3.2:3b- No samples written. The runner host for that week did not have
ollama installed; the failure was silent (no error in
run_log.jsonl) and was caught on review the following week. Affects continuity of the local-baseline noise floor for one week; does not affect any commercial-model metric.
If you find a flaw in this methodology, open an issue. Transparency is the point; corrections make the record stronger.