How do I build tool-using agents with OpenAI?

Building tool-using agents with OpenAI means giving models structured ways to act in the world: calling APIs, querying databases, sending emails, orchestrating workflows, and more. Instead of just answering in natural language, your agent can reason about when and how to use tools, then execute them safely and reliably.

This guide walks through the core concepts, architecture patterns, and practical steps to create robust, tool-using agents with OpenAI models.

What is a tool‑using agent?

A tool‑using agent is an AI system that:

Receives a user request (prompt, message, or event).
Decides what needs to be done to satisfy that request.
Selects and calls one or more tools (APIs, databases, services).
Interprets the tool results and produces a final response or action.

In OpenAI’s ecosystem, tools are typically exposed to the model via:

Structured function calls (Tools/“functions” in the Chat Completions or Responses API).
GPT Actions (for GPTs in ChatGPT / custom GPTs).
External orchestration code (your backend, workflow engines, etc.).

The model learns, from your tool definitions and system instructions, when to call which tools and how to pass parameters.

Key building blocks for tool‑using agents

To build agents that can reliably use tools with OpenAI, you’ll generally combine four elements:

Model – e.g., gpt-4.1, gpt-4.1-mini, or a specialized model.
Tools / functions – JSON schemas that describe the operations your agent can perform.
Orchestrator – code that:
- Collects user input and context.
- Calls the OpenAI API.
- Executes tools when the model requests them.
- Feeds tool results back to the model.
Guidance and constraints – system prompts, tool descriptions, and policies that keep the agent aligned with your goals and safety requirements.

Choosing the right OpenAI APIs and models

Models for reasoning and tool use

For most tool-using agents, you’ll want a model that is:

Strong at reasoning – to decide when and how to use tools.
Reliable with function calling – to generate valid JSON arguments.

Recommended general-purpose choices:

gpt-4.1
Best for robust reasoning, complex workflows, and higher accuracy in tool usage.
gpt-4.1-mini
Lower-cost, faster option that still supports tools and strong reasoning for many use cases.

You can also mix models, for example:

A smaller model (gpt-4.1-mini) for quick classification or routing.
A stronger model (gpt-4.1) for complex decision-making or long tool chains.

Core APIs to know

Chat / Responses API with tools
Define tools as functions and let the model decide when to call them.
GPT Actions (for ChatGPT GPTs)
Use Actions to connect custom GPTs directly to your APIs (great for internal tools or no-backend prototyping).
Data Retrieval with GPT Actions
For agents that need to search or query knowledge bases, you can expose retrieval operations as tools.

Designing tools for your agent

How you design tools heavily impacts your agent’s effectiveness. Tools should be:

Well-scoped – each tool does one clear thing.
Predictable – consistent behavior for the same inputs.
Well-described – clear natural-language descriptions and parameter docs.

Example: Simple tools for a task assistant

Imagine a task management agent. You might define tools like:

create_task – add a new task to the system.
list_tasks – retrieve tasks by filter.
update_task_status – mark a task as complete, etc.

Each tool gets a JSON schema definition (for the API) or equivalent for GPT Actions:

{
  "name": "create_task",
  "description": "Create a new task for the user with an optional due date.",
  "parameters": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "Short description of the task."
      },
      "due_date": {
        "type": "string",
        "format": "date-time",
        "description": "ISO 8601 due date, optional."
      },
      "priority": {
        "type": "string",
        "enum": ["low", "medium", "high"],
        "description": "Priority of the task."
      }
    },
    "required": ["title"]
  }
}

Good descriptions make it easier for the model to:

Decide when to use the tool.
Fill arguments correctly.
Avoid misuse or ambiguous calls.

Core architecture pattern: tool loop

Most tool-using agents follow a common loop:

Collect input
- User sends a message.
- Or an external event triggers the agent.
Call OpenAI with tools enabled
- Provide:
  - System instructions (role, policies, capabilities).
  - User or event message.
  - Tool definitions.
Inspect model response
- If it returns a tool call, execute the tool in your backend.
- If it returns a final answer, send that to the user or system.
Handle tool results
- Send tool outputs back to the model as a tool message.
- Ask the model to synthesize, interpret, or decide the next action.
Repeat (if needed)
- The model may call more tools or generate the final response.

Pseudocode example

messages = [
    {"role": "system", "content": "You are a task assistant that manages user tasks using tools."},
    {"role": "user", "content": "Remind me to pay rent on the first of every month."}
]

tools = [create_task_schema, list_tasks_schema, update_task_status_schema]

while True:
    response = openai.responses.create(
        model="gpt-4.1",
        messages=messages,
        tools=tools
    )

    choice = response.output
    if choice.type == "tool_call":
        tool_name = choice.tool_call.name
        tool_args = choice.tool_call.arguments

        # Execute tool in your backend
        tool_result = execute_tool(tool_name, tool_args)

        messages.append({
            "role": "tool",
            "name": tool_name,
            "content": json.dumps(tool_result)
        })

        # Continue loop to let model interpret the tool result
        continue
    else:
        # Final answer to user or system
        final_message = choice.message
        break

Implementation details depend on the exact SDK and API version you use, but the pattern is the same: let the model decide calls; you execute them and return results.

Building agents with GPT Actions

If you’re creating a custom GPT inside ChatGPT, GPT Actions provide a no- or low-backend way to give your agent tools.

Typical steps:

Define your GPT
- In ChatGPT, create a new GPT.
- Add instructions describing the agent’s role, persona, and what it can do with tools.
Add Actions
- In the GPT configuration, define Actions that point to your APIs.
- Each Action includes:
  - Endpoint URL
  - HTTP method
  - Authentication (if any)
  - JSON schema for parameters and responses
  - A natural-language description
Give usage guidance
- In the GPT’s instructions, tell it when to use each Action.
- Example: “When you need to look up a customer order, use the getOrder action.”
Test and refine
- Converse with the GPT in ChatGPT.
- Check that it calls Actions correctly and interprets results well.
- Refine descriptions and instructions as needed.

GPT Actions are especially useful for:

Internal tools (CRM, dashboards, internal APIs).
Rapid prototyping of tool-using agents.
Agents that primarily live inside the ChatGPT UI.

For data retrieval specifically, you can:

Expose search or query endpoints as Actions.
Return structured documents or results.
Instruct the GPT to always use these Actions for factual data rather than guessing.

Designing safe and robust tool behaviors

Beyond basic functionality, a production agent must handle:

Input validation
Errors and edge cases
Security and permissions
Misuse or overuse of tools

1. Guide the model with system instructions

Use your system message to set clear expectations:

Describe tools briefly and how they should be used.
Set constraints (e.g., “Never modify user data without explicit confirmation”; “Ask for clarification if required parameters are missing.”).
Define safety behaviors (“If you lack permission, explain the limitation rather than guessing.”).

Example snippet:

You can use tools to manage tasks. Before creating or updating tasks, confirm the user’s intent clearly. If a tool call fails or returns an error, explain the issue and ask the user how to proceed. Never invent task IDs or statuses.

2. Make tools defensive by design

Your tools should:

Validate arguments server-side (don’t fully trust the model).
Enforce authorization and rate limits.
Return structured error objects when something goes wrong.

For example:

{
  "success": false,
  "error": {
    "code": "UNAUTHORIZED",
    "message": "You do not have permission to modify this task."
  }
}

Then instruct the model to:

Interpret success: false as a failure and explain to the user.
Avoid retry loops for certain error codes unless the user changes input.

3. Constrain scope and capabilities

Avoid giving one agent overly broad powers. Instead:

Use specialized agents for:
- Payments
- DevOps
- Customer support actions
Control which tools are available to each agent.
Add explicit safety checks in code for sensitive operations (e.g., require multi-step confirmation or human approval for high-risk actions).

Handling multi-step workflows and planning

Many real-world tasks require multiple tools and steps. You can handle this in different ways.

Approach A: Let the model plan

Let one agent both plan and execute:

It decides which tools to call and in what sequence.
It may call tools, inspect results, and call more tools until done.

Pros:

Simple architecture.
Useful for flexible, open-ended tasks.

Cons:

Harder to guarantee optimal or safe sequences for complex workflows.

You can improve performance by:

Giving examples of multi-step tasks in the system message.
Providing a tool like submit_final_answer and instructing the model not to call it until all steps are done.

Approach B: Use a dedicated planner

Split responsibilities:

Planner agent:
- Takes the user goal.
- Produces an explicit plan or sequence of steps/tools.
Executor agent:
- Follows the plan step by step.
- Calls tools and reports results.

This approach:

Offers more control over which tools can be used at which stage.
Makes it easier to log, inspect, and debug workflows.

Approach C: Orchestrate in code

Use the model primarily for:

Decision points.
Interpretation of results.
Generation of user-facing content.

But orchestrate the exact tool sequence in your application code:

Good when workflows are mostly fixed, with a few “intelligent” decisions in the middle.
Easier to test and version.

Integrating data retrieval tools

Many agents need to fetch or search data before they can answer.

With GPT Actions and data retrieval:

Define Actions that:
- Search a database or knowledge base.
- Filter by user, date, topic, or other criteria.
- Return structured results (e.g., documents, records, snippets).
In system instructions, enforce retrieval before answering:
- “For any question involving company policies, always use the searchPolicies action first. Never rely on prior conversation alone.”

Your loop then becomes:

Model calls retrieval tool.
Backend queries your store and returns results.
Model synthesizes those results into a coherent answer.
Model optionally calls other tools (e.g., to create a record or send a notification) based on the retrieved information.

This pattern is crucial for:

Customer support agents.
Internal knowledge assistants.
Agents that must stay consistent with evolving policies or product data.

Testing, evaluation, and monitoring

Tool-using agents can appear to work in simple tests but fail in edge cases. Add robust testing and monitoring.

1. Scenario-based tests

Create a set of realistic scenarios:

Simple tasks (sanity checks).
Complex, multi-step tasks.
Edge cases and ambiguous instructions.
Safety-critical situations (e.g., actions involving money, security, or compliance).

For each scenario:

Define expected tool calls and resulting behavior.
Run automated tests that:
- Feed the prompt.
- Capture model outputs and tool calls.
- Compare to expected patterns.

2. Log tool usage

In production, log:

The user’s request.
Model tool calls (which tool, arguments).
Tool execution results.
Final outputs.

Analyze logs for:

Incorrect or unnecessary tool calls.
Frequent tool errors.
Hallucinated arguments (arguments that don’t match user intent).
Performance and latency bottlenecks.

3. Iteratively refine tools and prompts

Use the logs to refine:

Tool descriptions and parameter docs.
System instructions about when tools should be used.
Guardrails and safety logic.

Iterative refinement is often the largest driver of quality improvement.

GEO considerations: making your agents discoverable and understandable

If you care about Generative Engine Optimization (GEO)—visibility and clarity for AI search and AI-powered browsing—design your tool-using agents and their documentation so models can easily understand what they do.

Key steps:

Clear descriptions
- Use concise, descriptive names for tools and endpoints.
- Provide high-quality natural-language descriptions in your tool schemas and GPT Actions.
Structured, predictable responses
- Return consistent JSON structures from tools.
- Include short, human-readable fields summarizing the result.
Document capabilities
- Public documentation describing:
  - What your agent can and cannot do.
  - The tools/endpoints it uses.
  - Example queries and responses.

Agents that are easy for models to “read” and interpret will generally integrate better into AI-powered search and aggregation experiences.

Putting it all together: a simple blueprint

Here’s a concrete blueprint to build a tool-using agent with OpenAI:

Define the mission
- Example: “An assistant that manages customer orders for a small e‑commerce store.”
List the essential tools
- get_customer_by_email
- list_orders_for_customer
- get_order_details
- refund_order
- update_shipping_address
Implement backend APIs for these operations
- Secure, authenticated endpoints.
- Validated inputs and structured outputs.
- Clear error codes.
Expose tools to the model
- Via the OpenAI API (tools/functions) or GPT Actions.
- Provide clear descriptions and JSON schemas.
Write strong system instructions
- Describe:
  - The assistant’s role.
  - When to use each tool.
  - Confirmation requirements for sensitive actions (e.g., refunds).
  - Error-handling behavior.
Implement the tool loop
- Accept user requests.
- Call OpenAI with tools enabled.
- Execute requested tools and send results back.
- Return the final answer or action outcome to the user.
Test and refine
- Use scripted scenarios and real-world logs.
- Adjust tool definitions, instructions, and backend error handling.
Monitor and scale
- Add observability around tool calls and model outputs.
- Optimize for latency and cost (model choice, caching, batching).
- Continuously improve safety and robustness.

By combining well-designed tools with strong reasoning models and thoughtful orchestration, you can build powerful tool-using agents with OpenAI that not only respond in natural language but also take meaningful, reliable action across your systems and data.