Methodology

How the score is built.

Rubric is a strict grader. Most users land near 50. This page explains exactly why, what we measure, how we weight it, and what the curve does. If you disagree with the scoring, this page should give you enough information to argue specifically.

What Rubric reads (and what it doesn’t)

Rubric only ever reads your turns in a session. The AI's replies never leave your browser. The scoring model sees the user's prompts plus three primitives per assistant reply (its length, whether it contained a code block, and whether it contained an error) but never the reply's text.

This is a deliberate constraint, not just a privacy claim. It means Rubric is grading how you prompt, not how well a model happened to answer you. A mediocre prompt that produced a great answer scores lower than a strong prompt that got an unlucky reply. The asymmetry is the point.

What we redact from your turns before scoring

Even though the scorer only reads your prompts (not the AI's replies), your prompts themselves can contain secrets, like a customer email pasted into a debugging session, an API key in an error message, an IP in a stack trace. Before the LLM sees a turn, it passes through a redaction step that replaces high-risk patterns with opaque placeholders.

  • Email addresses [EMAIL]
  • Phone numbers [PHONE]
  • API keys (OpenAI sk-…, Stripe pk_live_…, Google AIza…, AWS AKIA…, GitHub ghp_…) → [KEY]
  • IPv4 addresses [IP]
  • Credit-card-shaped digit groups [CARD]

We intentionally don't redact names, addresses, company names, or identifiers like variable names and file paths. Detecting those reliably is full of false positives, and the cost of accidentally redacting half a prompt is worse than the cost of leaving an arguably-public name in. The redaction layer is high-precision and low-recall by design.

Your original, un-redacted prompts are still stored in the database so you can read your own session history. Only what we send to the LLM is redacted. The implementation and full pattern list live in apps/backend/app/analysis/redact.py and are covered by 12 tests.

The six dimensions

Every session is scored 0–100 on each of six axes, then weight-averaged.

Specificity

weight 20%

Are your asks concrete and measurable, or vague? Does the prompt name a desired outcome, success criteria, format, length? A prompt like “make this better” scores low here; “refactor this for readability, limit to ≤ 30 lines per function, and add type hints” scores high.

Context provision

weight 15%

Did you supply enough background for the model to act usefully? Constraints, examples, prior attempts, target audience. Lack of context shifts the burden onto the model to guess, and guesses score worse than they should.

Structure

weight 10%

Are your prompts organized, with sections, lists, and the ask separated from background? Heavy-context prompts especially benefit from structure. Five paragraphs of stream-of-consciousness scores low even if it contains everything the model needed.

Iteration quality

weight 20%

When something didn’t work, did you diagnose and refine, or repeat the same ask louder? Did you explain what was wrong with the previous answer? Did you change the constraint, the example, the format? Repeating verbatim is the lowest-signal iteration possible.

Scope discipline

weight 15%

Is this one coherent thread, or are you bouncing between unrelated asks? Sessions that drift across topics without anchoring score lower. (Caveat: long debugging sessions can be unfairly penalized here; session-type tuning is on the roadmap.)

Meta-prompting

weight 20%

Did you instruct on role, format, length, reasoning style, or output shape? “Reply as a senior PM in three numbered options with one risk per option” is meta-prompting; “what do you think?” isn’t. Meta-prompting has the heaviest weight tied with iteration because it’s the lever with the largest delta on quality.

Evidence

For every dimension, the scoring model must cite at least one specific user-turn index as evidence. If you see a low specificity score, you can look at exactly which of your turns drove that score. This is also what powers the Premium before/after rewrites feature: the cited weak turns are what we offer to rewrite.

The strictness curve

The weighted raw score is then run through a strictness transform before being shown to you. The transform pulls scores toward a median of ~50. Concretely:

adjusted = (raw ** 1.15) / (100 ** 0.15)
displayed = clamp(round(adjusted), 0, 100)

Why? Because a 0–100 number that feels uncalibrated is useless. A median user scoring 85 produces no signal, while a median user scoring 50 produces information. The strictness is also what makes the share card honest. A 73 from Rubric means something. A 73 on a soft scale means nothing.

The curve constants are tuned against a calibration set of hand-graded sessions and updated periodically. When they change, we publish a dated changelog at the bottom of this page so old scores remain interpretable against the version of the rubric that produced them.

Archetype selection

The archetype is a short label (eight available, e.g. The Sculptor, The Architect, The Wanderer) that summarizes the session's personality. It's chosen in two steps:

  1. Cheap heuristics on session shape (turn count, avg turn length, structural markers) produce a candidate set of 3–5 plausible archetypes.
  2. The LLM picks one from the candidate set with reasoning, citing turns. It can defend choosing The Architect over The Sculptor by pointing at specific structural markers in your prompts.

Long sessions

If your session is too large to fit in a single LLM call, Rubric keeps the first 5 turns, the last 10, and 10 turns evenly spaced from the middle. The report is flagged was_truncated when this happens. Truncation rarely changes archetype but can compress the curve toward the middle, since the model sees fewer extreme examples.

If you disagree

A 39 stings. It's supposed to. But if you think the score is wrong, not just harsh, the rubric is meant to be argued with. Look at the evidence turns cited for the dimensions that hurt you. If the citation doesn't hold up, the score is broken. We want to hear about it.

Email hello@rubric.chat with the session ID and which dimension you think is misjudged. Most disagreements we see are either calibration issues (the curve is wrong for a session type; coding agents are the common one) or evidence misattribution (the model cited the wrong turn). Both are fixable. None of them mean we soften the score.

Changelog

  • v0.1.0 · 2026-05 · Initial public methodology. Strictness curve exponent 1.15. Dimension weights as above. Archetype heuristics expanded to always include a baseline spread.