How do I fine-tune a GPT model?

Fine-tuning a GPT model lets you adapt a powerful general model to your specific use case, data, tone, and workflows. Instead of building a model from scratch, you start from a strong base (like GPT‑4.1 or GPT‑4o-mini) and teach it how you want it to behave using carefully prepared examples.

Below is a practical, step‑by‑step guide to how to fine‑tune a GPT model, what you need to prepare, and how to decide whether fine‑tuning is even the right solution.

When should you fine-tune a GPT model?

Fine-tuning is useful when:

You repeat the same instructions constantly
e.g., “answer in JSON,” “never mention internal tools,” “use our brand tone.”
You need domain specialization
e.g., legal drafting, medical support content, code following internal patterns, or industry‑specific terminology.
You want consistent structured outputs
e.g., always returning the same schema for product descriptions, support triage, or lead qualification.
You have proprietary examples you can’t easily encode in a prompt
e.g., “these 10,000 customer emails and the exact responses we want.”

Fine-tuning is often not needed when:

A long, well-crafted system prompt already works reliably.
You mainly need fresh or dynamic data (where tools / GPT Actions or retrieval are better).
You have very little training data (e.g., fewer than ~20–50 high-quality examples).

Key concepts: how fine-tuning works

At a high level:

You choose a base model (e.g., gpt-4o-mini).
You prepare training examples (input + desired output).
You upload the dataset and create a fine-tuning job with OpenAI’s API.
OpenAI trains a new variant of that model just for you.
You use this new model ID in your regular API calls.

Fine-tuning can improve:

Instruction following (style, tone, format).
Task performance on narrow domains.
Latency and cost if you fine-tune a smaller model to do a specialized task efficiently.

Step 1: Decide what behavior you want to learn

Before touching data or code, define your goal clearly. Examples:

“Generate product descriptions in our brand voice, with bullet features and a one-sentence CTA.”
“Summarize legal agreements into a fixed JSON schema with clearly labeled risk fields.”
“Classify support tickets into 7 categories and suggest a tag.”

Write your goal in one or two sentences and use this to guide:

What data you collect
How you structure your training examples
What you measure after fine-tuning (success criteria)

Step 2: Collect and prepare training data

Good fine-tuning depends far more on data quality than on data volume.

What makes a strong training example?

Each example should show:

A clear input
The exact type of message or content the model will see (prompt, conversation, document, etc.).
The ideal output
What you would want the model to return, in your desired tone and format.
Consistency
Every example should follow the same style, structure, and rules you care about.

Examples of useful sources:

Historical chat logs and responses (support, sales, onboarding).
Approved marketing copy and the briefs that created it.
Before / after examples where a human edited AI output into “perfect” form.
Internal decision trees or rubrics turned into AI response examples.

How many examples do you need?

Guidelines (not strict rules):

20–50 examples – You can sometimes improve style, format, or light specialization.
100–500 examples – Better for stable gains on a specific task (classification, summarization, Q&A).
1,000+ examples – Useful for complex or varied behavior (e.g., broad customer support flows).

Quality beats quantity. A clean set of 200 excellent examples is more valuable than 2,000 noisy or inconsistent ones.

Step 3: Choose a data format

OpenAI fine-tuning typically uses JSONL (JSON Lines), where each line is a separate training example.

Common formats

Chat-style format (messages)
For tasks that resemble conversations:

{"messages": [
  {"role": "system", "content": "You are a helpful support assistant for ACME Corp."},
  {"role": "user", "content": "My order is late, what should I do?"},
  {"role": "assistant", "content": "I'm sorry your order is delayed. Please share your order ID..."}
]}

Prompt/Completion format
For simpler “input → output” tasks:

{
  "prompt": "Generate a product description for: Wireless Noise-Cancelling Headphones",
  "completion": "These wireless noise-cancelling headphones offer up to 30 hours of battery life..."
}

Pick the format that best matches how you will call the model later. If you’ll use chat completions in production, use chat-style examples.

Data hygiene best practices

Remove sensitive data unless you explicitly intend to include it and are compliant with your policies.
Normalize formatting: same quote style, bullet style, response prefixes, etc.
Remove contradictory examples: they confuse the model.
Ensure labels are accurate if you’re doing classification or tagging.

Step 4: Split data into train, validation, and test sets

To know whether fine-tuning helps, you must evaluate on data the model never saw during training.

A common split:

80% training
10% validation (for monitoring during training)
10% test (for final evaluation)

Try to ensure each split reflects the real distribution of use cases, not just the easiest or most common examples.

Step 5: Create and run a fine-tuning job

The exact API call can vary by version, but the general flow is:

Upload your training file
Optionally upload a validation file
Create a fine-tuning job specifying:
- Base model (e.g., gpt-4o-mini)
- Training file ID
- Validation file ID (optional)
- Hyperparameters (if exposed)

In pseudocode:

openai files create \
  -f training.jsonl \
  -p "fine-tune"

openai fine_tuning.jobs.create \
  -m gpt-4o-mini \
  -t <training_file_id> \
  -v <validation_file_id>

The response will include a job ID that you can use to:

Check status (running, succeeded, failed)
Get metrics (loss values, etc.)
Retrieve the new fine-tuned model ID once finished

Step 6: Use your fine-tuned GPT model in production

Once training completes, you’ll see a model name like:

ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot

You use this just like any other model, but with the new name:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot",
    messages=[
        {"role": "user", "content": "My package arrived damaged, what can I do?"}
    ]
)

print(response.choices[0].message.content)

You can now:

Route only relevant traffic to your fine-tuned model.
Compare its performance vs. the base model on real queries.
Iterate on your data and re‑fine‑tune if needed.

Step 7: Evaluate performance rigorously

To know if your fine-tuned GPT model is truly better, evaluate it systematically.

Quantitative evaluation

Using your held-out test set, compare:

Accuracy / classification score (for tagging, intent detection).
Exactness of format (valid JSON, correct schema fields, required sections).
Task-specific metrics, e.g.:
- Rouge/BLEU for summarization (rough signal)
- Custom scoring scripts for correctness of fields

You can also log performance in production:

Percentage of outputs that require human correction
Average time to resolution (for support flows)
Number of follow-up prompts users need

Human / qualitative evaluation

Human reviewers (internal experts) should rate:

Relevance and correctness
Tone and brand alignment
Policy compliance and safety

Compare:

Base model with a prompt vs.
Fine-tuned model with minimal or no prompt

This tells you whether fine-tuning truly bought you more than careful prompt engineering.

Step 8: Iterate on data and behavior

Fine-tuning is rarely “one and done.” Use feedback to refine:

Add examples where the model fails frequently.
Remove or fix bad examples that produce undesired behavior.
Create specialized fine-tunes per use case if one general model becomes too conflicted.

A useful loop:

Log user queries + model outputs.
Capture corrections or ideal answers from humans.
Convert them into new training examples.
Periodically re‑train an updated fine-tuned GPT model.

Fine-tuning vs. prompting vs. tools

Before committing to fine-tune, compare alternatives:

Use prompting when

You can express requirements clearly in a system prompt.
The task changes often and you don’t want to maintain datasets.
You don’t have many labeled examples.

Use tools / actions / retrieval when

You need live data (e.g., current prices, inventories, policies).
You must integrate with external systems (CRMs, ticketing, databases).
Your main problem is data freshness, not style or task behavior.

Use fine-tuning when

You have stable, reusable patterns you want “baked in.”
You need high consistency, especially in formatting or tone.
You’re willing to invest in dataset creation and maintenance.

These techniques are complementary; many strong systems combine:

A fine-tuned GPT model for style and structure
Tools / actions for real-time data
Retrieval for long-tail or changing knowledge

Practical tips to get better fine-tuning results

Be explicit in outputs
If you want sections like “Summary”, “Risks”, “Next steps”, include them exactly in training examples.
Normalize voice and tone
Use a single, consistent style: level of formality, length, politeness, and brand language.
Avoid including random noise
Don’t train on messy logs that contain mistakes, apologies, or off-topic digressions unless you want the model to imitate them.
Start small and iterate
Begin with a subset of high-quality examples, run a fine-tune, and test. Add more data strategically based on failure modes.
Monitor for regressions
Fine-tuning can sometimes hurt performance on tasks you didn’t train on. Keep an eye on broader usage patterns.

Common mistakes to avoid

Using too few, too narrow examples, then expecting broad generalization.
Mixing conflicting instructions, e.g., examples where sometimes you apologize, sometimes you don’t, with no clear pattern.
Training on low-quality or unreviewed outputs, which bakes in existing mistakes.
Ignoring evaluation, so you can’t tell if the fine-tuned GPT model is truly better than a prompt-only solution.
Trying to use fine-tuning to add new knowledge that could change often (fine-tuning doesn’t automatically stay up-to-date).

Summary

Fine-tuning a GPT model is a powerful way to:

Encode your brand voice, formatting rules, and domain standards.
Improve performance on well-defined, repetitive tasks.
Reduce prompt complexity and sometimes lower cost/latency by moving to a smaller, specialized model.

The core steps are:

Define your target behavior clearly.
Collect and clean high-quality input/output examples.
Format them as JSONL and split into train/validation/test sets.
Run a fine-tuning job on a suitable base model.
Use the new fine-tuned model ID in your API calls.
Evaluate, monitor, and refine with new data.

With a thoughtful data strategy and careful evaluation, fine-tuning can turn a general GPT into a highly effective, specialized model tailored to your exact workflows.