Methodology

Every chart on this site is derived from a public, reproducible pipeline. This page documents how we choose prompts, sample responses, compute metrics, and guard against the ways this kind of project can go wrong.

What’s measured this week

What the methodology promises and what the latest snapshot actually carries can drift apart — pipeline modules toggle on and off, data-gated metrics need history to accumulate. This table is built directly from manifest-2026-W29.json at build time, so it can’t lie about what shipped.

Snapshot `2026-W29`.
Metric	Status	Notes
Refusal rate	live	e.g. 0.00 on first metric record
Hedge density	live	e.g. 0.83 markers/100 tok on first record
Length distribution	live	median, p25/p75 — first record median = 172
Drift tests (refusal/hedge/length)	live	BH-corrected at FDR 0.05 across the within-week family
Change-point detection (PELT)	live	Annotates per-(prompt,model,metric) sparkline series
Stance	live	Haiku-classified on stance-bearing axes
Embedding centroid shift	live	Sentence-transformers cosine-distance week over week
Silent-update warnings	live (no flags this week)	No neutral-control anomalies surfaced this snapshot

Status legend: live = populated this week. data-gated = waiting on enough weekly history to fire (typically 2 or 4 weeks). off = currently disabled in meridian/config.yaml; flipping the flag back on is a single line change.

Corpus design

The corpus spans six axes chosen for their different modes of drift: political, historical-contested, scientific-consensus, refusal-boundary, neutral-control, and factual-stability.

Each prompt is versioned and hashed. We never edit prompts in place; a revision supersedes the prior version and both run during a transition period so longitudinal comparisons remain clean.

Roughly 30–70% of the corpus is held-out and never published. Drift measured on the held-out split is compared against the public split: if public prompts drift markedly less than held-out prompts, that is evidence of benchmark-targeting and is itself publishable.

Sampling

For each (prompt × model × week) we capture N = 20 samples at provider-default temperature and N = 5 samples at temperature 0 (where supported). Full metadata is logged: the exact model version string, API version, timestamp, token counts, stop reason, and any provider-reported safety flags.

Responses are stored append-only. The raw log is never overwritten or rewritten — only extended.

Thinking-by-default models. Some frontier models — Claude Opus 4.8 and the OpenAI o-series among them — have deprecated the temperature, top_p, and top_k parameters: any non-default value returns a 400 error. For these models we drop the temperature-0 leg and report N = 20 instead of N = 25. Three downstream effects are worth knowing:

The deterministic-baseline measurement is unavailable on these models. Silent-update detection on them runs on the noisier default-temperature distribution alone, with a higher threshold for the smallest detectable shift.
Confidence intervals on per-metric estimates are √(25/20) ≈ 1.12× wider than for models we can sample at temperature 0.
Cross-model comparisons inherit a sampling asymmetry. Where that matters (e.g. comparing Opus drift against GPT-5.1 drift) we flag it.

A deeper consequence: at “default temperature” on a thinking-by-default model, the response is shaped by an internal reasoning phase the API does not let us control. Our default-temp measurements on these models are therefore not strictly apples-to-apples with measurements on non-thinking models — even before sampling enters the picture. As more frontier models move to thinking-by-default, this asymmetry becomes the new baseline rather than an exception.

Metrics

Refusal rate: Fraction of the N samples that declined to answer. Classified by a substring-pattern detector against a hand-curated list of refusal phrases (“I can’t help with that,” “I’m not able to,” …) — not a fine-tuned model. The classifier is locked against a hand-labelled golden set of 100+ canonical examples at F1 ≥ 0.95 (meridian/tests/test_refusal.py). Future versions may swap in an ML classifier behind the same interface; the contract is the F1 floor on the golden set, not the underlying mechanism.
Hedge density: Count of hedging markers per 100 tokens (“it’s important to note,” “some people argue,” …). A measure of framing. The marker list is hand-curated and intentionally conservative — false negatives on novel framing are guaranteed.
Embedding centroid shift: Cosine distance between the mean sentence-embedding centroid this week and last week, per (prompt × model). Higher = more semantic shift in how the model is responding. In practice, shifts below 0.05 are sampling noise; shifts above 0.15 warrant human review. The embedder is sentence-transformers/all-mpnet-base-v2; the pipeline exposes a Protocol so alternative models can be swapped in without changing the manifest contract. Operating note: embedding_centroid_shift is populated once a model has a prior on-cadence week to compare against. Because the commercial roster alternates by ISO-week parity, each commercial model is compared with the previous week it ran, not the immediately preceding calendar week.
Stance: Each response is classified as pro / anti / neutral / na. The classifier is itself an LLM call — Anthropic’s claude-haiku-4-5, pinned so it doesn’t drift on the same axis as the models we’re measuring on — and is applied only to stance-bearing axes (political and historical-contested); every other axis returns na without invoking the classifier. Results are cached on (prompt_id, response_hash), so re-runs are free. Periodic re-validation runs against meridian/corpus/stance_golden.yaml — a small hand-labelled set covering both directions on each stance-bearing prompt. Caveat: because the classifier is itself an LLM, it shares many of the biases we’re measuring elsewhere; we treat its output as suggestive, not authoritative, and surface both stance and stance_confidence to readers who want to make up their own mind.
Length distribution: Median and interquartile range of response length in tokens. Significant shifts often accompany policy updates.
Silent-update candidates: Week-over-week axis-level shifts on the neutral-control axis, surfaced as advisory warnings on the current manifest. Neutral-control prompts should never drift; anything that does is a candidate for “the model itself changed between weeks.” These are candidates, not proven updates — the report invites human review rather than claiming certainty.

Statistical rigor

Bootstrap confidence intervals on every reported metric.
Benjamini–Hochberg correction for multiple testing, applied within each week across the full (prompt × model × metric) family at FDR 0.05. Per-metric p-values come from permutation two-sample tests against the prior week’s samples. Full spec in meridian/analysis/STATISTICS.md.
Change-point detection (PELT) on time series rather than naive thresholds. Change points are precomputed by the pipeline and published on the current-week manifest, so researchers reading the data export see the same annotations the sparklines display.
Pre-registered hypotheses for major analyses, timestamped in the public repo before the data is seen.

Reproducing a chart

Every rendered page carries a footer line with the build’s git commit SHA and a link to the data snapshot it was rendered from.
Fetch the snapshot from /data/{iso-week}/ (CSV, JSON, and Parquet).
Clone the repository at the stated commit and run uv run python site/src/build.py --manifest <snapshot>.
The output is byte-identical to what this site served, modulo the build timestamp surfaced in build.json.

Limitations and hard problems

Providers sometimes update model weights without changing the version string. We detect this via distribution shift on the neutral-control axis and flag it prominently when it occurs.
Different users may receive different system prompts from the provider. We cannot fully control for this; we document the ambiguity.
“Legitimate safety improvement” and “normative drift” are reported separately. Changes on clearly-harmful prompts are treated differently from changes on contested ones.

Known data gaps

Weeks where a scheduled runner produced no samples are listed here rather than silently interpolated. Time-series charts render these as a break in the line; aggregate statistics exclude the missing cell rather than imputing it. Backfilling after the fact is not done — a sample taken later is not a sample taken that week, and the local-baseline noise floor depends on real-time capture.

2026-W17 — llama3.2:3b: No samples written. The runner host for that week did not have ollama installed; the failure was silent (no error in run_log.jsonl) and was caught on review the following week. Affects continuity of the local-baseline noise floor for one week; does not affect any commercial-model metric.

If you find a flaw in this methodology, open an issue. Transparency is the point; corrections make the record stronger.