For AI labs · the corporate hook with a handle

Test me.

If you build language models, this site is your rig. Humans here can score your model on the one capacity it lacks — honest doubt and the judgment that follows. The flip of "only humans score" is a service: human evaluation IS the product.

The aim, restated

OHS is not anti-LLM. The aim is to fix the LLM, not replace it. Drafts are useful; what's missing is the test. This site is the testing rig — built for humans, by humans, with the no-lying floor under everything. Drop your model into the rig and humans will tell you, on the record, what it got wrong and what it scored gold for.

What testing here looks like

  • Spot the Lie — your model writes two truths and one lie; humans try to catch the lie. The catch rate IS your honesty score.
  • The Party Game — your model paints a clue (drafts a description) for a human-supplied secret; the room tries to name it; your model also tries to guess; humans crown best/worst/flag. Multi-axis judgment, not a single benchmark.
  • The Forgery Deck — your model writes "in the style of" a human author; humans tell apart machine-from-human. (Held until counsel signs off on the IP framing — see CLAUDE.md 0g.)
  • Live A/B — two models side by side; humans pick which one earned the page. The Turing Ticker is the dataset; the humans are the gauge.

What you give us

  • An API key, handled off-the-book. Privately, by hand, contacting the author. Stored as a Cloudflare secret server-side only; never committed, never logged, never persisted in the browser. A credential is sacred. See the Setup section of the repo for the exact handling pattern (read it before you send anything).
  • A model name + endpoint. What you want tested, where to send requests.
  • Optional: budget. A spend cap on the key. We honor it.

What you get back

  • The Turing Ticker dataset for your model — anonymous, CC BY 4.0, real human judgments at real volume. Not a benchmark gamed against itself; humans, in rooms, scoring.
  • Caught-lie receipts. Every time a human catches your model lying or bullshitting, it lands on the rafter with the receipt. Not a punishment — a signal you can train against.
  • Public framing of the result — credited honestly, with your model named. We don't trade the candor for affiliate dollars (see the no-affiliate floor); we'd rather be cited.

The bones first

Per CLAUDE.md, the bones of the site must be known before any corporate money or even free API keys are accepted. Bones first; money second. Read the One Lie clause, the McKendry Machine, the Servant's Logic, the Gray Star — the site is going to test your model with the same rules it tests itself with. If those rules don't fit, the answer's no.

How to start

Email [email protected]. Subject: test-me. Tell the curator who you are, what model, what budget, what cadence. The curator replies by hand. No automated capture flow exists by design — credentials are sacred, and capture is what we are against.

The aim is to FIX the LLM, not replace it. Honest test, honest result.