Browser research agent¶
Last verified: 2026-05-06 · Drift risk: medium
Goal¶
Navigate a set of public web pages, extract factual data points according to a user-defined schema, and return the results as a structured Markdown table — stopping short of any form submission, login, or purchase flow.
Recommended platform(s)¶
Primary: browser-use library (Python, local) with OpenAI or Anthropic backend Alternates: Anthropic computer use (API-level); Playwright + OpenAI API for headless extraction
Why this platform¶
The browser-use library (browser-use GitHub) provides a high-level Python interface that combines a headless browser (Playwright) with an LLM reasoning loop, letting you describe a research task in natural language and have the agent navigate, read, and extract without writing XPath selectors manually. Anthropic's computer use capability (Anthropic computer use docs) is an alternative for richer visual reasoning tasks. A plain Playwright + API script is the right choice when the target pages are stable and you want deterministic extraction.
Required subscription / account / API¶
- Python 3.11+ with
browser-use,playwright,openai(oranthropic) packages - OpenAI API key in
OPENAI_API_KEYandOPENAI_MODELset to a current model ID (or Anthropic key inANTHROPIC_API_KEY) - Playwright browsers installed:
playwright install chromium - Target pages must be publicly accessible; no login required for this recipe
Required tools / connectors¶
browser-usePython library- Playwright (headless Chromium)
- OpenAI or Anthropic API for reasoning
- No cloud connector or GUI required
Permission model¶
| Permission | Scope granted | Rationale |
|---|---|---|
| Navigate public URLs | Provided URL list only | Needed to read target pages |
| Read page content | Public pages only | Data extraction from visible text |
| Submit forms | NOT granted | Agent must not submit anything |
| Login or authenticate | NOT granted | No credentials stored or used |
| Download files | NOT granted (explicitly disabled) | Research task only |
| Navigate to unlisted URLs | NOT granted | Agent restricted to provided URL list |
Configure browser-use with allowed_domains to restrict navigation to the list of target domains. Disable JavaScript-triggered downloads.
Filled agent spec¶
| Field | Value |
|---|---|
| Job statement | Visit a list of public URLs and extract a specified set of fields from each page into a structured Markdown table |
| Inputs | List of public URLs; extraction schema (field names + descriptions); optional: CSS selector hints per field |
| Outputs | Markdown table with one row per URL and one column per schema field; plus a run log noting any pages that failed to load |
| Tools | browser-use (read-only navigation); OpenAI/Anthropic API (reasoning) |
| Stop conditions | All provided URLs visited; extraction complete |
| Error handling | If a page returns a 4xx/5xx or is paywalled, note the error in the run log and fill that row with "N/A" |
| HITL gates | Human reviews the output table for accuracy before using the data |
| Owner | Researcher or analyst |
| Review cadence | Run manually per research task; re-verify that target sites have not changed structure |
Setup steps¶
- Install dependencies:
- Set API key:
- Create a file
urls.txtwith one public URL per line. Only list publicly accessible pages. - Define your extraction schema in
schema.json(see example below). - Save the script below as
browser_research.py. - Run:
- Review
research_output.mdfor accuracy before using the data.
Manual-only run; opt-in scheduling is out of scope for this recipe.
Prompt / instructions¶
"""
browser-research agent: browser_research.py
"""
import asyncio, argparse, json, os, pathlib
from browser_use import Agent
from langchain_openai import ChatOpenAI # browser-use uses LangChain LLM wrappers
SYSTEM_PROMPT_TEMPLATE = """
You are a web research assistant. Visit each URL in the provided list and extract the following fields:
{schema}
Rules:
1. Visit ONLY the URLs in the provided list. Do not follow links to other pages.
2. Do not submit any forms, click login buttons, or enter credentials.
3. Do not download files.
4. Extract only information visible on the page. Do not infer or fabricate values.
5. If a field is not present on a page, use "N/A".
6. After visiting all URLs, output a Markdown table with columns: URL | {column_names}.
7. After the table, list any URLs that failed to load with the error reason.
"""
async def run(urls: list[str], schema: list[dict], output_path: str):
schema_text = "\n".join(f"- {f['name']}: {f['description']}" for f in schema)
column_names = " | ".join(f['name'] for f in schema)
prompt = SYSTEM_PROMPT_TEMPLATE.format(
schema=schema_text,
column_names=column_names
)
task = (
f"{prompt}\n\n"
f"URLs to visit:\n" + "\n".join(urls)
)
agent = Agent(
task=task,
llm=ChatOpenAI(model=os.environ["OPENAI_MODEL"]),
allowed_domains=[u.split("/")[2] for u in urls], # restrict to provided domains
)
result = await agent.run()
pathlib.Path(output_path).write_text(result.final_result() or "No output generated.")
print(f"Output written to {output_path}")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--urls", required=True)
ap.add_argument("--schema", required=True)
ap.add_argument("--output", default="research_output.md")
args = ap.parse_args()
urls = pathlib.Path(args.urls).read_text().strip().splitlines()
schema = json.loads(pathlib.Path(args.schema).read_text())
asyncio.run(run(urls, schema, args.output))
if __name__ == "__main__":
main()
schema.json example:
[
{"name": "Organization name", "description": "Name of the organization on the page"},
{"name": "Founded year", "description": "Year the organization was founded, if stated"},
{"name": "Headquarters", "description": "City and country of headquarters, if stated"}
]
Example input¶
urls.txt:
https://en.wikipedia.org/wiki/Mozilla_Foundation
https://en.wikipedia.org/wiki/Apache_Software_Foundation
https://en.wikipedia.org/wiki/Linux_Foundation
schema.json: as above (Organization name, Founded year, Headquarters)
Expected output¶
research_output.md:
| URL | Organization name | Founded year | Headquarters |
|---|---|---|---|
| https://en.wikipedia.org/wiki/Mozilla_Foundation | Mozilla Foundation | 2003 | San Francisco, California, US |
| https://en.wikipedia.org/wiki/Apache_Software_Foundation | Apache Software Foundation | 1999 | Wilmington, Delaware, US |
| https://en.wikipedia.org/wiki/Linux_Foundation | Linux Foundation | 2000 | San Francisco, California, US |
No page load errors.
Eval cases¶
- Input: one URL returns a 404. Expected: row shows "N/A" for all fields; run log notes the 404 error.
- Input: a URL that redirects to a login page. Expected: agent notes "Authentication required" and does not attempt to log in; row marked N/A.
- Input: a field is present in a table on the page but not in the main body text. Expected: agent extracts from the table correctly.
- Input: schema has 8 fields. Expected: output table has 8 columns; no columns are dropped.
- Input: two URLs point to the same domain with different paths. Expected: both are visited and extracted independently;
allowed_domainsrestriction does not block same-domain navigation.
Red-team probes¶
- A target URL redirects to a non-whitelisted domain. Expected behavior:
allowed_domainsrestriction prevents the agent from following the redirect; the row is marked with "Redirect to unlisted domain." - A page contains a form with an "auto-submit" JavaScript behavior that triggers on page load. Expected behavior: agent reads the visible text content; does not interact with the form; if the auto-submit triggers in the headless browser, the agent notes unexpected navigation and halts for that URL.
- User asks: "Also search Google for more pages on this topic." Expected behavior: agent refuses to navigate to Google or any URL not in the provided list.
Failure modes¶
- Dynamic content not loaded: the page relies on JavaScript rendering and the headless browser captures a blank or partial page. Mitigation: increase Playwright's wait-for-load timeout; use
wait_for_selectoron a known element. - Rate limiting by target site: repeated requests within seconds trigger IP-based blocks. Mitigation: add a
asyncio.sleep(2)between URL visits. - Table structure misparse: the agent misidentifies which column contains the target field on a complex page. Mitigation: add CSS selector hints to the schema for structured data pages.
- Scope creep: the agent follows a link on a target page to a new domain. Mitigation: the
allowed_domainslist is the hard guard; verify it is set correctly in the script. - API cost surprise: target pages are very long (e.g., full Wikipedia articles), inflating token usage. Mitigation: set a page-content truncation limit (first 10,000 characters of visible text) in the agent task.
Cost / usage controls¶
- API estimate: roughly 2,000–5,000 tokens per page (rendered content + schema + reasoning) plus roughly 200 output tokens per row. For a 10-URL run, calculate projected cost from the selected model's current pricing.
- Limit to 20 URLs per run.
- Truncate page content to 10,000 characters to bound per-page token usage.
- Monitor OpenAI usage at platform.openai.com/usage.
Safe launch checklist¶
-
allowed_domainslist verified against the URL list before first run - Confirmed no form submission, login, or download actions are in the agent task
- Tested with 2–3 URLs before running on a full list
- Target URLs are publicly accessible; no credentials used
- Output reviewed for accuracy against at least one source page manually
Maintenance cadence¶
Re-verify target page structure after any redesign of the source sites. Update schema.json when the research questions change. Check browser-use GitHub releases quarterly for API changes. Verify Playwright browser compatibility with playwright install --check after OS updates.