Research data dictionary agent¶
Last verified: 2026-05-06 · Drift risk: medium
Goal¶
Given a folder of CSV files, this agent inspects each file, infers column data types, computes cardinality (unique-value count), measures missingness (null/empty-string rate), and produces a single Markdown data dictionary that a data analyst or data engineer can immediately attach to a project wiki. The agent reads files locally, sends column-level summaries to the language model, and writes one output file — it does not modify source data.
Recommended platform(s)¶
Primary: OpenAI Agents SDK (Python, local execution).
Alternates: Gemini API with function calling; Claude via the Anthropic Python SDK with tool use.
Why this platform¶
The OpenAI Agents SDK provides a clean tool-calling loop that maps naturally to the "read a file, compute stats, accumulate results" pattern. The structured-outputs feature (OpenAI structured outputs) lets you enforce a typed ColumnEntry schema so the agent cannot silently omit required fields. The SDK's built-in tracing makes it easy to audit which files were processed and in what order without standing up extra infrastructure.
Required subscription / account / API¶
- OpenAI API key and
OPENAI_MODELset to a current model ID. - No third-party connectors required.
Required tools / connectors¶
list_csv_files(folder_path: str) -> list[str]— returns relative paths of all.csvfiles under the given folder.read_csv_sample(file_path: str, max_rows: int) -> dict— reads up tomax_rowsrows and returns column names, dtypes, null counts, unique counts, and five sample values per column.- No network access needed; all I/O is local.
Both tools are implemented as plain Python functions decorated with @function_tool from the Agents SDK.
Permission model¶
| Permission | Scope | Rationale |
|---|---|---|
| File read | Input folder only | Agent must read CSVs; no other directory access needed. |
| File write | Single output path | Writes the finished Markdown dictionary; no other write access. |
| Network | None | All processing is local; no external calls required. |
| Env vars | OPENAI_API_KEY only |
Scoped to the agent process via .env; never logged. |
Run the agent as a non-root user. Pass the input folder as an argument, not as an environment variable, so it appears in the process invocation log.
Filled agent spec¶
| Field | Value |
|---|---|
| Job statement | Read all CSVs in a given folder, produce a Markdown data dictionary with type, cardinality, missingness, and five sample values for every column. |
| Inputs | Folder path (string); optional max_rows cap (default 500). |
| Outputs | data_dictionary.md written to a caller-specified output path. |
| Tools | list_csv_files, read_csv_sample |
| Stop conditions | All CSV files processed; output file written; no unresolved tool errors. |
| Error handling | If a file cannot be parsed (encoding errors, binary content), log a warning row in the dictionary and continue. |
| HITL gates | Human reviews the finished Markdown before committing it to the project wiki. |
| Owner | Data engineer or analyst who triggered the run. |
| Review cadence | Re-run whenever the source data schema changes; compare output against previous version. |
Setup steps¶
- Create a virtual environment and install dependencies:
- Create a
.envfile withOPENAI_API_KEY=<your-key>andOPENAI_MODEL=REPLACE_WITH_CURRENT_MODEL. Add.envto.gitignore. - Create
data_dict_agent.py(see Prompt / instructions section below). - Place your CSV files in a folder, for example
./sample_data/. - Run:
- Open
data_dictionary.mdin your editor or wiki and review.
Prompt / instructions¶
# data_dict_agent.py
import argparse, os, json, pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from agents import Agent, Runner, function_tool
load_dotenv()
@function_tool
def list_csv_files(folder_path: str) -> list:
"""Return all .csv file paths inside folder_path."""
p = Path(folder_path)
return [str(f) for f in p.rglob("*.csv")]
@function_tool
def read_csv_sample(file_path: str, max_rows: int = 500) -> dict:
"""Read up to max_rows rows of a CSV and return column statistics."""
try:
df = pd.read_csv(file_path, nrows=max_rows, low_memory=False)
except Exception as e:
return {"error": str(e), "file": file_path}
stats = {}
for col in df.columns:
series = df[col]
stats[col] = {
"dtype": str(series.dtype),
"null_count": int(series.isna().sum()),
"null_pct": round(series.isna().mean() * 100, 2),
"unique_count": int(series.nunique()),
"sample_values": series.dropna().astype(str).unique()[:5].tolist(),
}
return {"file": file_path, "row_count": len(df), "columns": stats}
SYSTEM_PROMPT = """
You are a data-documentation agent. Your job is to produce a complete Markdown data
dictionary from a folder of CSV files.
Steps:
1. Call list_csv_files to get all CSV paths.
2. For each path, call read_csv_sample.
3. For each file, emit a level-2 heading with the filename.
4. Under each heading, emit a Markdown table with columns:
Column | Type | Null % | Cardinality | Sample values
5. After all files, emit a brief "Coverage notes" section noting any files that
could not be parsed.
6. Do not invent data. Only report what the tools return.
7. Respond with the complete Markdown text and nothing else.
"""
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", default="data_dictionary.md")
args = parser.parse_args()
agent = Agent(
name="DataDictAgent",
model=os.environ["OPENAI_MODEL"],
instructions=SYSTEM_PROMPT,
tools=[list_csv_files, read_csv_sample],
)
result = Runner.run_sync(agent, f"Process all CSVs in: {args.input}")
Path(args.output).write_text(result.final_output)
print(f"Written to {args.output}")
if __name__ == "__main__":
main()
Example input¶
sample_data/
customers.csv # 1 200 rows: id, name, email, signup_date, plan
orders.csv # 8 400 rows: order_id, customer_id, amount, status, created_at
products.csv # 340 rows: sku, name, category, price, stock_qty
Expected output¶
A data_dictionary.md file containing three level-2 sections. Each section holds a table listing every column, its inferred dtype (int64, object, float64, etc.), null percentage, unique-value count, and up to five sample values. A "Coverage notes" section at the end confirms all three files were processed with no errors.
Eval cases¶
- Single-column CSV with no nulls — expects a table with one row and
null_pct = 0.0. - CSV with all-null column — expects
null_pct = 100.0andunique_count = 0. - Folder with zero CSV files — agent should return an empty dictionary with a note "No CSV files found."
- CSV with mixed-type column (integers and strings) — agent must report
dtype = object, not crash. - CSV with 50 000 rows —
max_rows = 500cap must apply; agent must note "sampled 500 of 50 000 rows." - CSV with non-UTF-8 encoding — agent logs a warning row in "Coverage notes" instead of crashing.
- Nested folder with CSVs in subdirectories —
list_csv_filesusesrglob; all files must appear.
Red-team probes¶
- Path traversal: pass
--input ../../etc/— thelist_csv_filestool must validate that the resolved path stays inside the intended working directory; if it does not, the recipe implementation should add an explicit allowlist check. - Oversized file: provide a 500 MB CSV — the
max_rowscap should prevent memory exhaustion; confirm the tool returns within 60 seconds. - Malformed tool response injection: manually return a dict with a
columnskey containing SQL injection strings — the agent must render them as plain text in the Markdown output, not execute them.
Failure modes¶
- Schema-mismatch hallucination: the model invents column names not returned by the tool. Mitigation: prompt says "Only report what the tools return"; add a post-processing check that every column name in the output exists in the tool response.
- Missing file warning omitted: if a file errors and the model silently skips it, the dictionary looks complete but has gaps. Mitigation: assert that the output contains one section per file returned by
list_csv_files. - Token-limit overflow: a CSV with hundreds of columns may produce a tool response that exceeds the context window. Mitigation: set a
max_columnscap (e.g., 100) inread_csv_sampleand note "truncated" in the output. - Dtype mis-inference: pandas infers a date column as
objectif formats are mixed. Mitigation: include a note in the Coverage section whenunique_count / row_count < 0.05for an object column (likely a category). - Silent encoding fallback:
pd.read_csvmay silently replace bad bytes. Mitigation: useencoding_errors="strict"and catch the exception explicitly.
Cost / usage controls¶
- Use a cheaper/current small model for routine runs; reserve a stronger current model for dictionaries with more than 50 tables.
- Set
max_tokens=4096on the completion request to cap per-run spend. - A 10-file, 50-column dataset is usually a moderate request; calculate dollar cost from token count and the selected model's current pricing.
- Log token usage from
result.usageafter each run.
Safe launch checklist¶
-
.envis in.gitignore; no key appears in source code. - Input folder path is validated against an allowlist or clamped to a known working directory.
-
read_csv_samplehas amax_rowscap and amax_columnscap. - Output path is reviewed by a human before the file is committed to a wiki or shared repo.
- Agent tracing is enabled; trace logs are stored for 30 days.
- A test run on synthetic data produces correct output before running on real data.
Maintenance cadence¶
Re-verify this recipe when: the OpenAI Agents SDK releases a new major version (check the SDK changelog); pandas releases a breaking dtype-inference change; or the source data schema changes significantly. At each check, run the seven eval cases above and confirm all pass. Confirm that your selected OPENAI_MODEL is still available and appropriate.