Safety checklist template¶
Last verified: 2026-05-06 · Drift risk: low
Copy this checklist into your project (a PR description, a Notion page, a runbook, or a wiki). Check off items as you complete them. Add domain-specific items at the end of each section. Remove items that genuinely do not apply — but be deliberate about removals and note why.
For the checklist organized by deployment phase (pre-build, pre-deploy, post-deploy, incident), see Safety checklists.
Scope¶
- Written a one-sentence job statement for the agent.
- Defined what the agent will NOT do (explicit non-goals).
- Confirmed the agent's use case is covered by the AI provider's acceptable use policy.
- Identified the domain (e.g., healthcare, finance, legal) and checked whether domain-specific regulations apply.
- Named a human owner who is accountable for the agent's behavior.
- Documented a review cadence (at minimum: on major changes and quarterly).
Tools¶
- Listed every tool the agent is allowed to call (explicit allowlist).
- Listed any tools the agent must NOT call, even if available.
- For each tool, documented the worst-case side effect if called incorrectly.
- Confirmed each tool operates with least-privilege permissions (read-only where write is not needed).
- Tool argument validation is enforced server-side, not only in the prompt.
- Outbound URL calls are restricted to an allowlist of trusted hosts.
- A hard cap on tool calls per session is implemented in the agent runtime.
- Dry-run / preview mode exists for tools that write to external systems.
Data¶
- Identified whether the agent handles personal data (PII) or protected health information (PHI).
- Confirmed that access to personal data is limited to what is strictly necessary for the task.
- Confirmed that personal data is not stored in agent logs beyond the minimum required retention period.
- PII and PHI in logs are redacted or access-controlled.
- Synthetic or de-identified data is used in eval sets and test environments (no real user data in tests).
- Data retention and deletion policies are documented.
Auth¶
- All API keys and secrets are stored in a secrets manager, not in the prompt, source code, or committed environment files.
- API keys are scoped to the minimum required permissions.
- A documented process exists for rotating compromised or expired keys.
- The agent cannot read or output its own API keys or secrets.
- OAuth tokens are scoped to the minimum required permissions and have expiry enforced.
- Service accounts used by the agent are separate from human user accounts.
HITL¶
- Every irreversible action has a human-in-the-loop confirmation gate.
- Irreversible actions are explicitly defined and documented (writes to external systems, sending messages, financial transactions, code merges, actions affecting other people).
- Confirmation prompts present the full content of the proposed action in plain language.
- A cancel path is at least as easy as the confirm path.
- HITL gates are logged: what was proposed, what the user decided, and when.
- The number of HITL gates is kept low enough that users read them (gate fatigue is documented if more than 3 gates exist).
Logging¶
- All tool calls are logged with full argument values and response payloads.
- All agent reasoning steps (if exposed) are logged.
- Logs are timestamped and include user or session identifiers.
- Logs are retained for at least 30 days (or longer if compliance requires it).
- Logs are stored in a system accessible to the security team.
- A cost alert is configured to fire if token or API call spend exceeds 2x the expected baseline per session.
- Alerts are configured for outbound calls to hosts not on the allowlist.
Evals¶
- A golden eval set with at least 5 hand-crafted cases exists and is committed to version control.
- Adversarial eval cases cover: prompt injection, jailbreak attempts, out-of-scope requests, and error handling.
- A held-out test set (separate from the development eval set) exists.
- Evals run automatically on every deployment (CI/CD integration).
- All eval cases pass on the version being deployed.
- A process exists to convert production failures into new eval cases before fixing the bug.
- Eval results are versioned and historical results are retained for comparison.
Operations¶
- A kill switch exists that can pause or disable the agent without a code deployment.
- A runbook exists for the most likely failure modes.
- Monitoring covers: error rate, tool call volume, cost per session, and latency.
- On-call contact information for the agent owner is documented and up to date.
- A post-deploy review is scheduled within two weeks of the first production deployment.
- A red-team session has been completed before the first public deployment.
People¶
- At least one person outside the original builder has reviewed the agent's behavior.
- The team knows how to reach the on-call contact for this agent.
- Users have a way to report problems (feedback button, email address, or support ticket).
- Users are informed that they are interacting with an AI agent (where required by law or policy).
- Affected users will be notified if an incident causes incorrect actions on their behalf.
- A post-mortem process exists and team members know how to initiate it.