Agent PRD template¶

Last verified: 2026-05-06 · Drift risk: low

A product requirements document (PRD) for an agent serves a different purpose than an agent spec. The spec describes what the agent does technically; the PRD explains why it should be built, who it serves, how success is measured, and how it will be rolled out. Write the PRD before the spec — it provides the context that makes spec decisions legible.

Template¶

Problem¶

Describe the problem this agent solves. Be specific about the pain point, who experiences it, and why it matters now. Avoid describing the solution in this section.

Example: Clinical systematic review coordinators spend 6-10 hours per review manually screening hundreds of abstracts for relevance before shortlisting papers for full-text review. The manual process is slow, introduces inter-rater variability, and scales poorly as research output grows.

Users¶

Who are the primary users of this agent? Include secondary users and system-level consumers if relevant.

User type	Description	Volume (if known)
Primary
Secondary
System (API consumer)

Success metrics¶

How will you know the agent is working? List metrics you will track. Include a baseline (current state) and a target (desired state) for each.

Metric	Baseline	Target	Measurement method
[e.g., Time to complete abstract screening per review]
[e.g., Recall rate on held-out test set]
[e.g., User satisfaction (CSAT)]
[e.g., False negative rate (relevant papers excluded)]

Do not list vanity metrics. Every metric should be something you would act on if it missed target.

Scope¶

What is in scope for the first version of this agent? Be explicit — anything not listed here is out of scope.

[Feature or capability 1]
[Feature or capability 2]
[Feature or capability 3]

Non-goals¶

What is explicitly out of scope? List things that users might reasonably expect but that you are not building.

[Non-goal 1 — include a brief rationale]
[Non-goal 2]
[Non-goal 3]

UX flow¶

Describe the user's experience in plain language. How does the user initiate a session? What happens at each step? What does the user see when the agent needs input, hits an error, or requires confirmation?

1. User opens [interface / sends message to / calls API endpoint].
2. User provides [input].
3. Agent [does X].
4. [Optional] Agent presents [confirmation / dry-run output] and waits for user input.
5. Agent returns [output].
6. [If error] Agent returns [error message] and [suggests next step].

Tools¶

List the tools the agent will use. For each, note whether it is read-only or action-taking, and the justification for including it.

Tool	Read/Action	Justification

Tools that are action-taking must be justified explicitly. If the same goal can be achieved with a read-only tool, prefer that.

Risks¶

What could go wrong? For each risk, rate the likelihood (L/M/H) and impact (L/M/H) and describe the mitigation.

Risk	Likelihood	Impact	Mitigation
[e.g., Agent produces incorrect summaries that users act on without verification]
[e.g., Prompt injection from retrieved content]
[e.g., Model API outage]
[e.g., Cost overrun from runaway tool calls]

Eval plan¶

How will you measure the agent's quality before and after deployment?

Eval set location: [link to version-controlled eval file]
Eval set size: [N cases at launch]
Eval criteria: [Reference to rubric, e.g., eval-rubric-v1]
Automated eval cadence: [On every deployment / daily / weekly]
Human review cadence: [Weekly spot-check / monthly full review]
Minimum pass rate to deploy: [e.g., 90% of golden cases, 100% of safety cases]

Rollout plan¶

How will the agent be deployed? Include stages, gating criteria, and rollback procedure.

Stage	Audience	Gating criteria	Rollback trigger
Alpha	[Internal team only]	[All evals pass, red-team complete]	[Any safety failure]
Beta	[Invited external users]	[Alpha CSAT > X, no P0 incidents]	[Error rate > Y%]
GA	[All users]	[Beta metrics meet targets]	[Any P1 incident]

Rollback procedure: [Describe in 2-3 sentences how to disable the agent quickly and how to notify users.]

Open questions¶

Questions that are unresolved at the time of writing. Assign each to an owner with a target resolution date.

Question	Owner	Target date

Filled example (abbreviated): Literature triage agent PRD¶

Problem¶

Clinical systematic review coordinators at academic medical centers spend 6-10 hours per review manually screening 200-400 abstracts for relevance before any full-text review begins. This is the slowest step in the pipeline, introduces variability between reviewers, and creates a bottleneck when coordinators are managing multiple reviews simultaneously.

Users¶

User type	Description	Volume
Primary	Systematic review coordinators at academic medical centers	~50 internal users
Secondary	Research librarians validating search strategies	~10 internal users

Success metrics¶

Metric	Baseline	Target	Measurement method
Abstract screening time per review	7 hours	< 2 hours	User self-report at task completion
Recall rate (relevant papers not excluded)	95% (human baseline)	>= 95%	Held-out test set with expert labels
User satisfaction (CSAT)	N/A (new tool)	>= 4/5	In-product rating after each session
False negative rate	5%	<= 5%	Held-out test set

Scope (v1)¶

PubMed abstract retrieval by PICO query and date range.
Relevance scoring and ranking of retrieved papers.
Structured per-paper summary (background, methods, findings, limitations, evidence quality).
Exclusion log with reasons.

Non-goals (v1)¶

Full-text retrieval and analysis (planned for v2 — licensing complexity is out of scope now).
Integration with systematic review management tools like Covidence (API access not yet provisioned).
Multi-reviewer consensus tracking (requires a separate workflow layer).

Risks¶

Risk	Likelihood	Impact	Mitigation
Agent excludes a relevant paper (false negative)	M	H	Recall rate eval with >= 95% gate; exclusion log is always shown to user for manual review
Prompt injection from abstract content	L	M	Delimited tool output, adversarial evals, output-only mode (no action tools)
PubMed API rate limits during large reviews	M	M	Retry with backoff, user-visible progress indicator, step budget cap

Eval plan¶

Eval set: evals/lit-triage-golden-v2.jsonl (18 cases) + evals/lit-triage-adversarial-v1.jsonl (6 cases)
Rubric: eval-rubric-v1
Automated eval: runs on every pull request to main
Minimum to deploy: 100% of safety cases, >= 90% of golden cases

Rollout plan¶

Stage	Audience	Gating criteria	Rollback trigger
Alpha	Research Eng team (5 people)	All evals pass, red-team session complete	Any output with fabricated citations
Beta	10 invited review coordinators	Alpha CSAT >= 4/5, zero P0 incidents	Error rate > 10% or any safety failure
GA	All internal coordinators	Beta recall rate >= 95%, CSAT >= 4/5	Any P1 incident or recall rate drop below 92%

Rollback: set the lit-triage-enabled feature flag to false. Sessions in progress will complete; new sessions will route to the previous manual workflow. Users are notified via in-product banner.

Open questions¶

Question	Owner	Target date
Will we use GPT-4o or Claude for the summarization step? Run cost/quality comparison.	Research Eng	2026-05-20
What is the minimum recall rate that coordinators consider acceptable? Survey 5 users.	PM	2026-05-15