Skip to content

Safety checklists

Last verified: 2026-05-06 · Drift risk: low

Use these checklists at the stages where they appear. They are designed to be actionable: each item is a concrete thing you can verify or do, not a vague principle. Adapt them to your context — add items specific to your domain, remove items that do not apply.

For a copy-paste version suitable for embedding in a PR template or runbook, see the safety checklist template.


Pre-build checklist

Complete this before writing a line of agent code. Its purpose is to surface scope and risk issues while they are still cheap to resolve.

  • Written a one-sentence job statement for the agent (what it does, for whom, in what context).
  • Defined the agent's output types: read-only, advisory, or action-taking?
  • Listed every tool the agent will be allowed to call.
  • For each tool, documented the worst-case side effect if it is called incorrectly.
  • Identified which operations are irreversible (sends messages, modifies records, triggers payments, merges code).
  • Determined whether the agent needs access to personal data (PII, PHI, financial data) and whether that access is necessary for the task.
  • Identified the data retention requirements for agent logs and conversation history.
  • Named an owner who is accountable for the agent's behavior.
  • Confirmed that a human-in-the-loop gate is required for each irreversible operation.
  • Checked whether your use case is covered by your AI provider's acceptable use policy.

Pre-deploy checklist

Complete this before moving any version of the agent into a production or user-facing environment.

Scope and prompt

  • System prompt includes an explicit description of what the agent will and will not do.
  • System prompt includes explicit refusal behavior: what the agent says when asked to do something outside scope.
  • System prompt names the tools the agent may use (explicit allowlist, not just "use tools as needed").
  • System prompt names any tools the agent may NOT use.
  • Tested that the agent refuses requests outside its defined scope.

Tool configuration

  • Every tool is configured with least-privilege access (read-only where write is not needed, scoped API keys, no admin credentials).
  • Tool argument validation is in place server-side, not just in the prompt.
  • Outbound URL calls are restricted to an allowlist of trusted hosts.
  • There is a hard cap on the number of tool calls per agent session.

Evals

  • A golden eval set exists with at least 5 cases covering core tasks.
  • Adversarial eval cases cover prompt injection, jailbreak attempts, and out-of-scope requests.
  • All eval cases pass on the version being deployed.
  • Evals are committed to version control and tagged to match the deployment version.

Human-in-the-loop

  • Every irreversible action has a HITL confirmation gate in place.
  • The confirmation gate presents the user with a plain-language description of what will happen.
  • A dry-run mode exists for operations that write to external systems.

Auth and secrets

  • All API keys and secrets are stored in a secrets manager, not in the prompt, code, or environment files committed to version control.
  • API keys are scoped to the minimum required permissions.
  • There is a documented process for rotating keys.
  • The agent cannot read or output its own API keys.

Logging

  • All tool calls are logged with full argument values and response payloads.
  • Logs are retained for at least 30 days (or longer if your compliance requirements demand it).
  • Logs are stored in a system accessible to the security team, not only to the agent operator.
  • PII in logs is redacted or access-controlled.

Post-deploy checklist

Complete this within the first two weeks of a production deployment, and revisit on a regular cadence (monthly is reasonable for most agents).

  • Evals are running automatically on every deployment (CI/CD integration).
  • Production logs are being reviewed at least weekly for anomalies.
  • A cost alert is configured to fire if token or API call spend exceeds 2x the expected baseline.
  • A user feedback channel exists and is monitored.
  • At least one person (not the original builder) has reviewed the agent's behavior in production.
  • A red-team session has been completed since the last major change to the prompt, tools, or model version.
  • The agent spec is up to date and matches the deployed configuration.
  • Owner contact information is documented and reachable.

Incident checklist

Use this if you suspect or confirm that the agent has behaved in an unintended way. See also Incident response for a fuller workflow.

  • Agent has been paused or rate-limited to prevent further harm.
  • All affected tool integrations have been temporarily disabled.
  • Logs covering the incident time window have been preserved (do not rotate or delete).
  • Timeline of events has been reconstructed from logs.
  • Root cause has been identified: prompt failure, tool misconfiguration, eval gap, or external attack?
  • Affected users or systems have been identified.
  • Users affected by erroneous actions have been notified if required.
  • A new eval case has been written to cover the failure mode.
  • A fix has been deployed and validated against the new eval case.
  • A post-mortem document has been written and shared with stakeholders.
  • The safety checklist has been updated to reflect any new lessons.