hypnex-bench
Public eval suite + leaderboard for the Morpheus AI inference network. Reference implementation for MRC 76 (agent benchmarking).
bash
pip install hypnex-benchCLI
bash
# List active models on Morpheus (no API key needed for read)
hypnex-bench models
# Run the eval suite against a specific model
HYPNEX_API_KEY=mor_... hypnex-bench run --model mistral-31-24b --limit 20
# Compare multiple models head-to-head
HYPNEX_API_KEY=mor_... hypnex-bench compare --models mistral-31-24b,glm-5,qwen3-235bWhat it measures
- Accuracy on a curated probe set across reasoning, code, math, knowledge
- Latency (P50 / P95 / P99 first-token + last-token)
- Cost-per-call in MOR (read from Morpheus's pricing)
- Tool-use compliance (does the model emit valid OpenAI tool-call shapes?)
- Streaming consistency (does the streamed accumulation match a non-streaming call?)
Reproducible probe sets
Hypnex ships bundled probe sets in python-bench/data/. They're versioned and committed — every published leaderboard score corresponds to a specific probe set hash, so results stay reproducible across time.
python
from hypnex_bench import probes
p = probes.load("v1.0/reasoning") # or "v1.0/code", "v1.0/math", ...
print(p.questions[0])Programmatic API
python
from hypnex_bench import BenchClient
c = BenchClient(api_key="mor_...")
result = c.run(
model="mistral-31-24b",
probe_set="v1.0/reasoning",
limit=10,
)
print(result.accuracy, result.latency_p50_ms, result.cost_per_call_mor)