AI evaluation

Make AI behavior testable before it becomes a vague promise.

Suede AI Eval turns an AI-powered product surface into an AI-SPEC, a failure-mode rubric, concrete prompt or retrieval eval cases, acceptance gates, and a coverage audit. Use it for LLM routes, classifiers, recommenders, agents, RAG/search, generated media, or any AI workflow that needs measurable quality and safety checks.

Install the skill View skill folder

Public install command

This is the public route. It installs from GitHub as a standard Codex skill folder.

python3 ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
  --repo JasonColapietro/suede-creator-skills \
  --path skills/suede-ai-eval

Restart Codex after installing the skill.

What it produces

  • AI-SPEC: user promise, inputs, outputs, allowed sources, forbidden behavior, fallback, latency, cost, and success signal.
  • Failure-mode rubric: severity, likelihood, detectability, current evidence, ship gate, and required fix.
  • Eval cases: stable IDs, input, setup, expected pass traits, forbidden traits, grade lane, and gate.
  • Coverage audit: what existing tests, logs, fixtures, prompt snapshots, screenshots, or live readbacks prove today.

Where it fits

Run it before shipping a new AI behavior, after a production miss, or when a product claim says AI quality is handled but the repo only has prompt review or happy-path manual testing.

It pairs naturally with Suede Agent Teams for large launches and Suede Code for implementation review. The eval plan defines what must be true; the code review checks whether the system actually enforces it.

Core gates

Best prompts

Use $suede-ai-eval to audit this AI-powered route and produce an AI-SPEC, failure-mode rubric, eval cases, acceptance gates, and missing coverage.
Use $suede-ai-eval on this RAG/search feature. Check stale sources, conflicting docs, missing citations, forbidden claims, and privacy boundaries.

Boundaries

The skill does not claim legal, rights, licensing, medical, financial, or compliance clearance. It does not upload data, call private services, invent datasets, or treat a model's self-judgment as sufficient evidence.