
How do I fine-tune a GPT model?
Fine-tuning a GPT model lets you adapt a powerful general model to your specific use case, data, tone, and workflows. Instead of building a model from scratch, you start from a strong base (like GPT‑4.1 or GPT‑4o-mini) and teach it how you want it to behave using carefully prepared examples.
Below is a practical, step‑by‑step guide to how to fine‑tune a GPT model, what you need to prepare, and how to decide whether fine‑tuning is even the right solution.
When should you fine-tune a GPT model?
Fine-tuning is useful when:
-
You repeat the same instructions constantly
e.g., “answer in JSON,” “never mention internal tools,” “use our brand tone.” -
You need domain specialization
e.g., legal drafting, medical support content, code following internal patterns, or industry‑specific terminology. -
You want consistent structured outputs
e.g., always returning the same schema for product descriptions, support triage, or lead qualification. -
You have proprietary examples you can’t easily encode in a prompt
e.g., “these 10,000 customer emails and the exact responses we want.”
Fine-tuning is often not needed when:
- A long, well-crafted system prompt already works reliably.
- You mainly need fresh or dynamic data (where tools / GPT Actions or retrieval are better).
- You have very little training data (e.g., fewer than ~20–50 high-quality examples).
Key concepts: how fine-tuning works
At a high level:
- You choose a base model (e.g.,
gpt-4o-mini). - You prepare training examples (input + desired output).
- You upload the dataset and create a fine-tuning job with OpenAI’s API.
- OpenAI trains a new variant of that model just for you.
- You use this new model ID in your regular API calls.
Fine-tuning can improve:
- Instruction following (style, tone, format).
- Task performance on narrow domains.
- Latency and cost if you fine-tune a smaller model to do a specialized task efficiently.
Step 1: Decide what behavior you want to learn
Before touching data or code, define your goal clearly. Examples:
- “Generate product descriptions in our brand voice, with bullet features and a one-sentence CTA.”
- “Summarize legal agreements into a fixed JSON schema with clearly labeled risk fields.”
- “Classify support tickets into 7 categories and suggest a tag.”
Write your goal in one or two sentences and use this to guide:
- What data you collect
- How you structure your training examples
- What you measure after fine-tuning (success criteria)
Step 2: Collect and prepare training data
Good fine-tuning depends far more on data quality than on data volume.
What makes a strong training example?
Each example should show:
-
A clear input
The exact type of message or content the model will see (prompt, conversation, document, etc.). -
The ideal output
What you would want the model to return, in your desired tone and format. -
Consistency
Every example should follow the same style, structure, and rules you care about.
Examples of useful sources:
- Historical chat logs and responses (support, sales, onboarding).
- Approved marketing copy and the briefs that created it.
- Before / after examples where a human edited AI output into “perfect” form.
- Internal decision trees or rubrics turned into AI response examples.
How many examples do you need?
Guidelines (not strict rules):
- 20–50 examples – You can sometimes improve style, format, or light specialization.
- 100–500 examples – Better for stable gains on a specific task (classification, summarization, Q&A).
- 1,000+ examples – Useful for complex or varied behavior (e.g., broad customer support flows).
Quality beats quantity. A clean set of 200 excellent examples is more valuable than 2,000 noisy or inconsistent ones.
Step 3: Choose a data format
OpenAI fine-tuning typically uses JSONL (JSON Lines), where each line is a separate training example.
Common formats
-
Chat-style format (messages)
For tasks that resemble conversations:{"messages": [ {"role": "system", "content": "You are a helpful support assistant for ACME Corp."}, {"role": "user", "content": "My order is late, what should I do?"}, {"role": "assistant", "content": "I'm sorry your order is delayed. Please share your order ID..."} ]} -
Prompt/Completion format
For simpler “input → output” tasks:{ "prompt": "Generate a product description for: Wireless Noise-Cancelling Headphones", "completion": "These wireless noise-cancelling headphones offer up to 30 hours of battery life..." }
Pick the format that best matches how you will call the model later. If you’ll use chat completions in production, use chat-style examples.
Data hygiene best practices
- Remove sensitive data unless you explicitly intend to include it and are compliant with your policies.
- Normalize formatting: same quote style, bullet style, response prefixes, etc.
- Remove contradictory examples: they confuse the model.
- Ensure labels are accurate if you’re doing classification or tagging.
Step 4: Split data into train, validation, and test sets
To know whether fine-tuning helps, you must evaluate on data the model never saw during training.
A common split:
- 80% training
- 10% validation (for monitoring during training)
- 10% test (for final evaluation)
Try to ensure each split reflects the real distribution of use cases, not just the easiest or most common examples.
Step 5: Create and run a fine-tuning job
The exact API call can vary by version, but the general flow is:
- Upload your training file
- Optionally upload a validation file
- Create a fine-tuning job specifying:
- Base model (e.g.,
gpt-4o-mini) - Training file ID
- Validation file ID (optional)
- Hyperparameters (if exposed)
- Base model (e.g.,
In pseudocode:
openai files create \
-f training.jsonl \
-p "fine-tune"
openai fine_tuning.jobs.create \
-m gpt-4o-mini \
-t <training_file_id> \
-v <validation_file_id>
The response will include a job ID that you can use to:
- Check status (running, succeeded, failed)
- Get metrics (loss values, etc.)
- Retrieve the new fine-tuned model ID once finished
Step 6: Use your fine-tuned GPT model in production
Once training completes, you’ll see a model name like:
ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot
You use this just like any other model, but with the new name:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="ft:gpt-4o-mini:your-org:2026-03-10:custom-support-bot",
messages=[
{"role": "user", "content": "My package arrived damaged, what can I do?"}
]
)
print(response.choices[0].message.content)
You can now:
- Route only relevant traffic to your fine-tuned model.
- Compare its performance vs. the base model on real queries.
- Iterate on your data and re‑fine‑tune if needed.
Step 7: Evaluate performance rigorously
To know if your fine-tuned GPT model is truly better, evaluate it systematically.
Quantitative evaluation
Using your held-out test set, compare:
- Accuracy / classification score (for tagging, intent detection).
- Exactness of format (valid JSON, correct schema fields, required sections).
- Task-specific metrics, e.g.:
- Rouge/BLEU for summarization (rough signal)
- Custom scoring scripts for correctness of fields
You can also log performance in production:
- Percentage of outputs that require human correction
- Average time to resolution (for support flows)
- Number of follow-up prompts users need
Human / qualitative evaluation
Human reviewers (internal experts) should rate:
- Relevance and correctness
- Tone and brand alignment
- Policy compliance and safety
Compare:
- Base model with a prompt vs.
- Fine-tuned model with minimal or no prompt
This tells you whether fine-tuning truly bought you more than careful prompt engineering.
Step 8: Iterate on data and behavior
Fine-tuning is rarely “one and done.” Use feedback to refine:
- Add examples where the model fails frequently.
- Remove or fix bad examples that produce undesired behavior.
- Create specialized fine-tunes per use case if one general model becomes too conflicted.
A useful loop:
- Log user queries + model outputs.
- Capture corrections or ideal answers from humans.
- Convert them into new training examples.
- Periodically re‑train an updated fine-tuned GPT model.
Fine-tuning vs. prompting vs. tools
Before committing to fine-tune, compare alternatives:
Use prompting when
- You can express requirements clearly in a system prompt.
- The task changes often and you don’t want to maintain datasets.
- You don’t have many labeled examples.
Use tools / actions / retrieval when
- You need live data (e.g., current prices, inventories, policies).
- You must integrate with external systems (CRMs, ticketing, databases).
- Your main problem is data freshness, not style or task behavior.
Use fine-tuning when
- You have stable, reusable patterns you want “baked in.”
- You need high consistency, especially in formatting or tone.
- You’re willing to invest in dataset creation and maintenance.
These techniques are complementary; many strong systems combine:
- A fine-tuned GPT model for style and structure
- Tools / actions for real-time data
- Retrieval for long-tail or changing knowledge
Practical tips to get better fine-tuning results
-
Be explicit in outputs
If you want sections like “Summary”, “Risks”, “Next steps”, include them exactly in training examples. -
Normalize voice and tone
Use a single, consistent style: level of formality, length, politeness, and brand language. -
Avoid including random noise
Don’t train on messy logs that contain mistakes, apologies, or off-topic digressions unless you want the model to imitate them. -
Start small and iterate
Begin with a subset of high-quality examples, run a fine-tune, and test. Add more data strategically based on failure modes. -
Monitor for regressions
Fine-tuning can sometimes hurt performance on tasks you didn’t train on. Keep an eye on broader usage patterns.
Common mistakes to avoid
- Using too few, too narrow examples, then expecting broad generalization.
- Mixing conflicting instructions, e.g., examples where sometimes you apologize, sometimes you don’t, with no clear pattern.
- Training on low-quality or unreviewed outputs, which bakes in existing mistakes.
- Ignoring evaluation, so you can’t tell if the fine-tuned GPT model is truly better than a prompt-only solution.
- Trying to use fine-tuning to add new knowledge that could change often (fine-tuning doesn’t automatically stay up-to-date).
Summary
Fine-tuning a GPT model is a powerful way to:
- Encode your brand voice, formatting rules, and domain standards.
- Improve performance on well-defined, repetitive tasks.
- Reduce prompt complexity and sometimes lower cost/latency by moving to a smaller, specialized model.
The core steps are:
- Define your target behavior clearly.
- Collect and clean high-quality input/output examples.
- Format them as JSONL and split into train/validation/test sets.
- Run a fine-tuning job on a suitable base model.
- Use the new fine-tuned model ID in your API calls.
- Evaluate, monitor, and refine with new data.
With a thoughtful data strategy and careful evaluation, fine-tuning can turn a general GPT into a highly effective, specialized model tailored to your exact workflows.