01
Choose real product case
Use an owned V1 product or project, not a toy prompt.
IdeaGrains tests AI models on real product work, then publishes evidence-backed reports showing which model made the strongest V2.
IdeaGrains is a benchmark catalogue built around real product cases. Each case starts as a V1 product or project, then different AI models are given the same improvement brief. The output is compared, scored, and published as a report linked back to the source case.
01
Use an owned V1 product or project, not a toy prompt.
02
Capture the product state, screenshots, constraints, and known issues.
03
Each model receives the same improvement task and product context.
04
Compare what changed in UX, code, judgement, and reliability.
05
Log the fixes, rewrites, debugging, and decisions needed from a human.
06
Show the evidence, caveats, source case, and practical verdict.
Real products include messy context: existing UX, technical constraints, deployment concerns, unclear tradeoffs, and user-facing consequences. That is where model differences become useful.
Every report links back to the product case it came from, so readers can trace the baseline, model attempt, evidence, and final verdict.
Human rescue is the manual correction needed before an output becomes usable: fixing bugs, tightening copy, rejecting bad product judgement, or repairing deploy risk.
Did the model create a meaningfully better V2 for the product and user?
Was the implementation clean, deployable, secure enough, and low-risk?
How much correction, judgement, debugging, or rewriting was needed?
Did the report include enough prompts, outputs, before/after notes, and caveats?
IdeaGrains is not an academic benchmark. It is a practical builder-focused benchmark using real product work. Scores include human judgement, documented evidence, visible caveats, and corrections when assumptions change.