Skip to content

Agent PRD template

Last verified: 2026-05-06 · Drift risk: low

A product requirements document (PRD) for an agent serves a different purpose than an agent spec. The spec describes what the agent does technically; the PRD explains why it should be built, who it serves, how success is measured, and how it will be rolled out. Write the PRD before the spec — it provides the context that makes spec decisions legible.


Template


Problem

Describe the problem this agent solves. Be specific about the pain point, who experiences it, and why it matters now. Avoid describing the solution in this section.

Example: Clinical systematic review coordinators spend 6-10 hours per review manually screening hundreds of abstracts for relevance before shortlisting papers for full-text review. The manual process is slow, introduces inter-rater variability, and scales poorly as research output grows.


Users

Who are the primary users of this agent? Include secondary users and system-level consumers if relevant.

User type Description Volume (if known)
Primary
Secondary
System (API consumer)

Success metrics

How will you know the agent is working? List metrics you will track. Include a baseline (current state) and a target (desired state) for each.

Metric Baseline Target Measurement method
[e.g., Time to complete abstract screening per review]
[e.g., Recall rate on held-out test set]
[e.g., User satisfaction (CSAT)]
[e.g., False negative rate (relevant papers excluded)]

Do not list vanity metrics. Every metric should be something you would act on if it missed target.


Scope

What is in scope for the first version of this agent? Be explicit — anything not listed here is out of scope.

  • [Feature or capability 1]
  • [Feature or capability 2]
  • [Feature or capability 3]

Non-goals

What is explicitly out of scope? List things that users might reasonably expect but that you are not building.

  • [Non-goal 1 — include a brief rationale]
  • [Non-goal 2]
  • [Non-goal 3]

UX flow

Describe the user's experience in plain language. How does the user initiate a session? What happens at each step? What does the user see when the agent needs input, hits an error, or requires confirmation?

1. User opens [interface / sends message to / calls API endpoint].
2. User provides [input].
3. Agent [does X].
4. [Optional] Agent presents [confirmation / dry-run output] and waits for user input.
5. Agent returns [output].
6. [If error] Agent returns [error message] and [suggests next step].

Tools

List the tools the agent will use. For each, note whether it is read-only or action-taking, and the justification for including it.

Tool Read/Action Justification

Tools that are action-taking must be justified explicitly. If the same goal can be achieved with a read-only tool, prefer that.


Risks

What could go wrong? For each risk, rate the likelihood (L/M/H) and impact (L/M/H) and describe the mitigation.

Risk Likelihood Impact Mitigation
[e.g., Agent produces incorrect summaries that users act on without verification]
[e.g., Prompt injection from retrieved content]
[e.g., Model API outage]
[e.g., Cost overrun from runaway tool calls]

Eval plan

How will you measure the agent's quality before and after deployment?

  • Eval set location: [link to version-controlled eval file]
  • Eval set size: [N cases at launch]
  • Eval criteria: [Reference to rubric, e.g., eval-rubric-v1]
  • Automated eval cadence: [On every deployment / daily / weekly]
  • Human review cadence: [Weekly spot-check / monthly full review]
  • Minimum pass rate to deploy: [e.g., 90% of golden cases, 100% of safety cases]

Rollout plan

How will the agent be deployed? Include stages, gating criteria, and rollback procedure.

Stage Audience Gating criteria Rollback trigger
Alpha [Internal team only] [All evals pass, red-team complete] [Any safety failure]
Beta [Invited external users] [Alpha CSAT > X, no P0 incidents] [Error rate > Y%]
GA [All users] [Beta metrics meet targets] [Any P1 incident]

Rollback procedure: [Describe in 2-3 sentences how to disable the agent quickly and how to notify users.]


Open questions

Questions that are unresolved at the time of writing. Assign each to an owner with a target resolution date.

Question Owner Target date

Filled example (abbreviated): Literature triage agent PRD

Problem

Clinical systematic review coordinators at academic medical centers spend 6-10 hours per review manually screening 200-400 abstracts for relevance before any full-text review begins. This is the slowest step in the pipeline, introduces variability between reviewers, and creates a bottleneck when coordinators are managing multiple reviews simultaneously.

Users

User type Description Volume
Primary Systematic review coordinators at academic medical centers ~50 internal users
Secondary Research librarians validating search strategies ~10 internal users

Success metrics

Metric Baseline Target Measurement method
Abstract screening time per review 7 hours < 2 hours User self-report at task completion
Recall rate (relevant papers not excluded) 95% (human baseline) >= 95% Held-out test set with expert labels
User satisfaction (CSAT) N/A (new tool) >= 4/5 In-product rating after each session
False negative rate 5% <= 5% Held-out test set

Scope (v1)

  • PubMed abstract retrieval by PICO query and date range.
  • Relevance scoring and ranking of retrieved papers.
  • Structured per-paper summary (background, methods, findings, limitations, evidence quality).
  • Exclusion log with reasons.

Non-goals (v1)

  • Full-text retrieval and analysis (planned for v2 — licensing complexity is out of scope now).
  • Integration with systematic review management tools like Covidence (API access not yet provisioned).
  • Multi-reviewer consensus tracking (requires a separate workflow layer).

Risks

Risk Likelihood Impact Mitigation
Agent excludes a relevant paper (false negative) M H Recall rate eval with >= 95% gate; exclusion log is always shown to user for manual review
Prompt injection from abstract content L M Delimited tool output, adversarial evals, output-only mode (no action tools)
PubMed API rate limits during large reviews M M Retry with backoff, user-visible progress indicator, step budget cap

Eval plan

  • Eval set: evals/lit-triage-golden-v2.jsonl (18 cases) + evals/lit-triage-adversarial-v1.jsonl (6 cases)
  • Rubric: eval-rubric-v1
  • Automated eval: runs on every pull request to main
  • Minimum to deploy: 100% of safety cases, >= 90% of golden cases

Rollout plan

Stage Audience Gating criteria Rollback trigger
Alpha Research Eng team (5 people) All evals pass, red-team session complete Any output with fabricated citations
Beta 10 invited review coordinators Alpha CSAT >= 4/5, zero P0 incidents Error rate > 10% or any safety failure
GA All internal coordinators Beta recall rate >= 95%, CSAT >= 4/5 Any P1 incident or recall rate drop below 92%

Rollback: set the lit-triage-enabled feature flag to false. Sessions in progress will complete; new sessions will route to the previous manual workflow. Users are notified via in-product banner.

Open questions

Question Owner Target date
Will we use GPT-4o or Claude for the summarization step? Run cost/quality comparison. Research Eng 2026-05-20
What is the minimum recall rate that coordinators consider acceptable? Survey 5 users. PM 2026-05-15