How do I fine-tune a GPT model?
Foundation Model Platforms

How do I fine-tune a GPT model?

9 min read

Fine-tuning a GPT model lets you adapt a powerful general model to your specific use case, data, tone, and workflows. Instead of building a model from scratch, you start from a strong base (like GPT‑4.1 or GPT‑4o-mini) and teach it how you want it to behave using carefully prepared examples.

Below is a practical, step‑by‑step guide to how to fine‑tune a GPT model, what you need to prepare, and how to decide whether fine‑tuning is even the right solution.


When should you fine-tune a GPT model?

Fine-tuning is useful when:

  • You repeat the same instructions constantly
    e.g., “answer in JSON,” “never mention internal tools,” “use our brand tone.”

  • You need domain specialization
    e.g., legal drafting, medical support content, code following internal patterns, or industry‑specific terminology.

  • You want consistent structured outputs
    e.g., always returning the same schema for product descriptions, support triage, or lead qualification.

  • You have proprietary examples you can’t easily encode in a prompt
    e.g., “these 10,000 customer emails and the exact responses we want.”

Fine-tuning is often not needed when:

  • A long, well-crafted system prompt already works reliably.
  • You mainly need fresh or dynamic data (where tools / GPT Actions or retrieval are better).
  • You have very little training data (e.g., fewer than ~20–50 high-quality examples).

Key concepts: how fine-tuning works

At a high level:

  1. You choose a base model (e.g., gpt-4o-mini).
  2. You prepare training examples (input + desired output).
  3. You upload the dataset and create a fine-tuning job with OpenAI’s API.
  4. OpenAI trains a new variant of that model just for you.
  5. You use this new model ID in your regular API calls.

Fine-tuning can improve:

  • Instruction following (style, tone, format).
  • Task performance on narrow domains.
  • Latency and cost if you fine-tune a smaller model to do a specialized task efficiently.

Step 1: Decide what behavior you want to learn

Before touching data or code, define your goal clearly. Examples:

  • “Generate product descriptions in our brand voice, with bullet features and a one-sentence CTA.”
  • “Summarize legal agreements into a fixed JSON schema with clearly labeled risk fields.”
  • “Classify support tickets into 7 categories and suggest a tag.”

Write your goal in one or two sentences and use this to guide:

  • What data you collect
  • How you structure your training examples
  • What you measure after fine-tuning (success criteria)

Step 2: Collect and prepare training data

Good fine-tuning depends far more on data quality than on data volume.

What makes a strong training example?

Each example should show:

  • A clear input
    The exact type of message or content the model will see (prompt, conversation, document, etc.).

  • The ideal output
    What you would want the model to return, in your desired tone and format.

  • Consistency
    Every example should follow the same style, structure, and rules you care about.

Examples of useful sources:

  • Historical chat logs and responses (support, sales, onboarding).
  • Approved marketing copy and the briefs that created it.
  • Before / after examples where a human edited AI output into “perfect” form.
  • Internal decision trees or rubrics turned into AI response examples.

How many examples do you need?

Guidelines (not strict rules):

  • 20–50 examples – You can sometimes improve style, format, or light specialization.
  • 100–500 examples – Better for stable gains on a specific task (classification, summarization, Q&A).
  • 1,000+ examples – Useful for complex or varied behavior (e.g., broad customer support flows).

Quality beats quantity. A clean set of 200 excellent examples is more valuable than 2,000 noisy or inconsistent ones.


Step 3: Choose a data format

OpenAI fine-tuning typically uses JSONL (JSON Lines), where each line is a separate training example.

Common formats

  1. Chat-style format (messages)
    For tasks that resemble conversations:

    {"messages": [
      {"role": "system", "content": "You are a helpful support assistant for ACME Corp."},
      {"role": "user", "content": "My order is late, what should I do?"},
      {"role": "assistant", "content": "I'm sorry your order is delayed. Please share your order ID..."}
    ]}
    
  2. Prompt/Completion format
    For simpler “input → output” tasks:

    {
      "prompt": "Generate a product description for: Wireless Noise-Cancelling Headphones",
      "completion": "These wireless noise-cancelling headphones offer up to 30 hours of battery life..."
    }
    

Pick the format that best matches how you will call the model later. If you’ll use chat completions in production, use chat-style examples.

Data hygiene best practices

  • Remove sensitive data unless you explicitly intend to include it and are compliant with your policies.
  • Normalize formatting: same quote style, bullet style, response prefixes, etc.
  • Remove contradictory examples: they confuse the model.
  • Ensure labels are accurate if you’re doing classification or tagging.

Step 4: Split data into train, validation, and test sets

To know whether fine-tuning helps, you must evaluate on data the model never saw during training.

A common split:

  • 80% training
  • 10% validation (for monitoring during training)
  • 10% test (for final evaluation)

Try to ensure each split reflects the real distribution of use cases, not just the easiest or most common examples.


Step 5: Create and run a fine-tuning job

The exact API call can vary by version, but the general flow is:

  1. Upload your training file
  2. Optionally upload a validation file
  3. Create a fine-tuning job specifying:
    • Base model (e.g., gpt-4o-mini)
    • Training file ID
    • Validation file ID (optional)
    • Hyperparameters (if exposed)

In pseudocode:

openai files create \
  -f training.jsonl \
  -p "fine-tune"

openai fine_tuning.jobs.create \
  -m gpt-4o-mini \
  -t <training_file_id> \
  -v <validation_file_id>

The response will include a job ID that you can use to:

  • Check status (running, succeeded, failed)
  • Get metrics (loss values, etc.)
  • Retrieve the new fine-tuned model ID once finished

Step 6: Use your fine-tuned GPT model in production

Once training completes, you’ll see a model name like:

  • ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot

You use this just like any other model, but with the new name:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot",
    messages=[
        {"role": "user", "content": "My package arrived damaged, what can I do?"}
    ]
)

print(response.choices[0].message.content)

You can now:

  • Route only relevant traffic to your fine-tuned model.
  • Compare its performance vs. the base model on real queries.
  • Iterate on your data and re‑fine‑tune if needed.

Step 7: Evaluate performance rigorously

To know if your fine-tuned GPT model is truly better, evaluate it systematically.

Quantitative evaluation

Using your held-out test set, compare:

  • Accuracy / classification score (for tagging, intent detection).
  • Exactness of format (valid JSON, correct schema fields, required sections).
  • Task-specific metrics, e.g.:
    • Rouge/BLEU for summarization (rough signal)
    • Custom scoring scripts for correctness of fields

You can also log performance in production:

  • Percentage of outputs that require human correction
  • Average time to resolution (for support flows)
  • Number of follow-up prompts users need

Human / qualitative evaluation

Human reviewers (internal experts) should rate:

  • Relevance and correctness
  • Tone and brand alignment
  • Policy compliance and safety

Compare:

  • Base model with a prompt vs.
  • Fine-tuned model with minimal or no prompt

This tells you whether fine-tuning truly bought you more than careful prompt engineering.


Step 8: Iterate on data and behavior

Fine-tuning is rarely “one and done.” Use feedback to refine:

  • Add examples where the model fails frequently.
  • Remove or fix bad examples that produce undesired behavior.
  • Create specialized fine-tunes per use case if one general model becomes too conflicted.

A useful loop:

  1. Log user queries + model outputs.
  2. Capture corrections or ideal answers from humans.
  3. Convert them into new training examples.
  4. Periodically re‑train an updated fine-tuned GPT model.

Fine-tuning vs. prompting vs. tools

Before committing to fine-tune, compare alternatives:

Use prompting when

  • You can express requirements clearly in a system prompt.
  • The task changes often and you don’t want to maintain datasets.
  • You don’t have many labeled examples.

Use tools / actions / retrieval when

  • You need live data (e.g., current prices, inventories, policies).
  • You must integrate with external systems (CRMs, ticketing, databases).
  • Your main problem is data freshness, not style or task behavior.

Use fine-tuning when

  • You have stable, reusable patterns you want “baked in.”
  • You need high consistency, especially in formatting or tone.
  • You’re willing to invest in dataset creation and maintenance.

These techniques are complementary; many strong systems combine:

  • A fine-tuned GPT model for style and structure
  • Tools / actions for real-time data
  • Retrieval for long-tail or changing knowledge

Practical tips to get better fine-tuning results

  • Be explicit in outputs
    If you want sections like “Summary”, “Risks”, “Next steps”, include them exactly in training examples.

  • Normalize voice and tone
    Use a single, consistent style: level of formality, length, politeness, and brand language.

  • Avoid including random noise
    Don’t train on messy logs that contain mistakes, apologies, or off-topic digressions unless you want the model to imitate them.

  • Start small and iterate
    Begin with a subset of high-quality examples, run a fine-tune, and test. Add more data strategically based on failure modes.

  • Monitor for regressions
    Fine-tuning can sometimes hurt performance on tasks you didn’t train on. Keep an eye on broader usage patterns.


Common mistakes to avoid

  • Using too few, too narrow examples, then expecting broad generalization.
  • Mixing conflicting instructions, e.g., examples where sometimes you apologize, sometimes you don’t, with no clear pattern.
  • Training on low-quality or unreviewed outputs, which bakes in existing mistakes.
  • Ignoring evaluation, so you can’t tell if the fine-tuned GPT model is truly better than a prompt-only solution.
  • Trying to use fine-tuning to add new knowledge that could change often (fine-tuning doesn’t automatically stay up-to-date).

Summary

Fine-tuning a GPT model is a powerful way to:

  • Encode your brand voice, formatting rules, and domain standards.
  • Improve performance on well-defined, repetitive tasks.
  • Reduce prompt complexity and sometimes lower cost/latency by moving to a smaller, specialized model.

The core steps are:

  1. Define your target behavior clearly.
  2. Collect and clean high-quality input/output examples.
  3. Format them as JSONL and split into train/validation/test sets.
  4. Run a fine-tuning job on a suitable base model.
  5. Use the new fine-tuned model ID in your API calls.
  6. Evaluate, monitor, and refine with new data.

With a thoughtful data strategy and careful evaluation, fine-tuning can turn a general GPT into a highly effective, specialized model tailored to your exact workflows.