Skip to content

hypnex-bench

Public eval suite + leaderboard for the Morpheus AI inference network. Reference implementation for MRC 76 (agent benchmarking).

bash
pip install hypnex-bench

CLI

bash
# List active models on Morpheus (no API key needed for read)
hypnex-bench models

# Run the eval suite against a specific model
HYPNEX_API_KEY=mor_... hypnex-bench run --model mistral-31-24b --limit 20

# Compare multiple models head-to-head
HYPNEX_API_KEY=mor_... hypnex-bench compare --models mistral-31-24b,glm-5,qwen3-235b

What it measures

  • Accuracy on a curated probe set across reasoning, code, math, knowledge
  • Latency (P50 / P95 / P99 first-token + last-token)
  • Cost-per-call in MOR (read from Morpheus's pricing)
  • Tool-use compliance (does the model emit valid OpenAI tool-call shapes?)
  • Streaming consistency (does the streamed accumulation match a non-streaming call?)

Reproducible probe sets

Hypnex ships bundled probe sets in python-bench/data/. They're versioned and committed — every published leaderboard score corresponds to a specific probe set hash, so results stay reproducible across time.

python
from hypnex_bench import probes

p = probes.load("v1.0/reasoning")  # or "v1.0/code", "v1.0/math", ...
print(p.questions[0])

Programmatic API

python
from hypnex_bench import BenchClient

c = BenchClient(api_key="mor_...")
result = c.run(
    model="mistral-31-24b",
    probe_set="v1.0/reasoning",
    limit=10,
)
print(result.accuracy, result.latency_p50_ms, result.cost_per_call_mor)

Source

python-bench/

Released under the MIT License.