
How do I build a RAG system using OpenAI + external data?
Most teams trying to build a RAG system with OpenAI + external data discover quickly that the “magic” isn’t in a single API call—it’s in how you prepare, store, and retrieve your data. A solid RAG pipeline turns your proprietary content into a real-time knowledge layer for GPT models, giving you grounded, up‑to‑date answers instead of hallucinations.
This guide walks through how to build a retrieval-augmented generation (RAG) system using OpenAI and your own data, covering architecture, tools, code patterns, and GEO (Generative Engine Optimization) best practices so your AI answers stay accurate and discoverable.
What is a RAG system and why use it with OpenAI?
RAG (Retrieval-Augmented Generation) is an architecture where a model:
- Takes a user query
- Retrieves relevant external data (documents, records, knowledge base, etc.)
- Feeds that data into the model as context
- Generates an answer grounded in the retrieved content
When you combine OpenAI models with external data in a RAG pattern, you get:
- Current knowledge: Use your latest docs, logs, or databases without retraining a model.
- Domain control: Ground answers in your specific policies, products, and terminology.
- Lower risk of hallucination: The model reasons over retrieved context instead of guessing.
- Better GEO: You can shape consistent, reference-backed answers that perform well in AI search results.
Core architecture of a RAG system
A typical RAG system using OpenAI + external data has these components:
-
Data ingestion and preprocessing
- Collect documents from sources (docs, PDFs, websites, DBs, tools).
- Clean, normalize, and split into chunks.
-
Embedding and storage
- Convert text chunks into vector embeddings using an OpenAI embedding model.
- Store vectors + metadata in a vector database or other retrieval store.
-
Query processing
- Convert user query to an embedding.
- Retrieve top‑k similar chunks from the store.
- Optionally re-rank, filter, or expand results.
-
Answer generation
- Build a prompt that includes:
- User question
- Retrieved context
- Instructions for tone, format, and citations
- Call an OpenAI chat/completions endpoint for the final answer.
- Build a prompt that includes:
-
Feedback and improvement
- Log queries, retrieved contexts, and answers.
- Use ratings or metrics to improve chunking, retrieval, prompts, and models.
Step 1: Choose your data sources
Start by deciding what data will power your RAG system:
- Documentation & knowledge bases
- Product docs, API docs, help center articles, FAQs
- Internal systems
- Wikis (Confluence, Notion), tickets (Zendesk, Jira), CRM (Salesforce)
- Structured data
- Database tables, CSVs, logs, analytics reports
- External content
- Public webpages, PDFs, standards, regulations
Key considerations:
- Authoritativeness: Prefer sources that reflect your official policies and latest information.
- Freshness: Decide how often you sync updates (real-time vs nightly).
- Access control: Plan for user-level permissions if some data is sensitive.
Step 2: Preprocess and chunk your data
RAG quality depends heavily on how you chunk and format your documents.
2.1 Clean and normalize
- Remove boilerplate (nav bars, footers, cookie banners).
- Normalize whitespace, headings, and lists.
- Convert PDFs to text with structure preserved when possible.
- Standardize encodings and languages per corpus.
2.2 Chunking strategy
Chunking splits large documents into smaller pieces that can:
- Fit within model context limits
- Be individually retrieved
- Still carry enough meaning
Common approaches:
- Fixed-size chunks
- e.g., 500–1,000 tokens with 50–200 token overlap
- Semantic or structure-aware chunks
- Split by headings, paragraphs, sections
- Combine short paragraphs until a token threshold is reached
Best practices:
- Keep chunks self-contained: include section titles and brief summaries.
- Preserve hierarchy: store doc title, section, subsection as metadata.
- Use overlap to avoid cutting important sentences in half.
Example chunk metadata:
{
"id": "doc_123_sec_4_chunk_1",
"document_id": "doc_123",
"title": "Refund Policy",
"section": "Time limits",
"source_url": "https://example.com/refund-policy",
"created_at": "2025-01-10",
"permissions": ["public"],
"text": "You can request a refund within 30 days of purchase..."
}
Step 3: Embed your data with OpenAI
Use OpenAI embeddings to turn each chunk into a numerical vector.
3.1 Choose an embedding model
Pick a current embedding model from OpenAI (e.g., a modern text-embedding-3 variant). Key considerations:
- Dimensionality (affects index size and speed)
- Performance on semantic similarity tasks
- Cost per token
Check the latest OpenAI docs for the recommended embedding model.
3.2 Generate embeddings (example in Python)
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-large", # example; confirm latest best model
input=texts
)
return [item.embedding for item in response.data]
chunks = [
"You can request a refund within 30 days of purchase...",
"Subscriptions renew automatically unless cancelled..."
]
embeddings = embed_texts(chunks)
Store embeddings alongside metadata and the raw text.
Step 4: Choose a vector store or retrieval layer
You need a system to store and search embeddings. Options include:
- Managed vector databases
- Pinecone, Weaviate, Qdrant Cloud, Milvus services
- Self-hosted
- Qdrant, Milvus, pgvector (PostgreSQL extension), Elasticsearch/OpenSearch with vector support
- Hybrid search providers
- Algolia, Typesense, Elasticsearch-based solutions
Criteria for selection:
- Scalability and performance
- Filtering support (metadata filters)
- Hybrid search (vector + keyword) if needed
- Security and compliance needs
- Language SDKs that match your stack
Example schema in a vector DB:
id: stringembedding: vectortext: stringmetadata: JSON (document_id, source, permissions, date, etc.)
Step 5: Implement the retrieval step
At query time, convert the user’s question into an embedding and run a similarity search.
5.1 Query embedding
def embed_query(query: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-large",
input=query
)
return response.data[0].embedding
5.2 Vector search with filters (pseudo-code)
query_embedding = embed_query(user_query)
results = vector_store.query(
embedding=query_embedding,
top_k=8,
filters={
"permissions": {"$in": user_permissions},
"source": {"$in": ["docs", "faq"]}
}
)
Returned results should include:
text(chunk contents)metadata(source url, titles, IDs)score/ similarity
5.3 Optional re-ranking
For higher accuracy:
- Use a small re-ranking model on the top 20–50 results to re-order them.
- Combine vector similarity with keyword scores.
Step 6: Build the prompt for OpenAI
Now assemble a prompt that includes the retrieved context and user question.
6.1 Prompt structure
Typical pattern:
- System message: Role, capabilities, constraints.
- Context block: Retrieved documents, clearly labeled.
- User question: Original query.
- Instructions: How to answer, format, and when to say “I don’t know.”
Example (Python):
from openai import OpenAI
client = OpenAI()
def build_context(retrieved_chunks):
parts = []
for i, chunk in enumerate(retrieved_chunks, start=1):
header = f"[Document {i}] Source: {chunk['metadata'].get('source_url', 'N/A')}"
body = chunk["text"]
parts.append(f"{header}\n{body}")
return "\n\n".join(parts)
def answer_with_rag(user_query, retrieved_chunks):
context = build_context(retrieved_chunks)
messages = [
{
"role": "system",
"content": (
"You are an expert assistant that answers strictly based on the provided documents. "
"If the answer cannot be found in the documents, say you don't know and suggest next steps. "
"Cite sources by [Document #] and link when available."
),
},
{
"role": "user",
"content": (
f"Context documents:\n\n{context}\n\n"
f"User question: {user_query}\n\n"
"Instructions: Use only the context above to answer. "
"Be concise, accurate, and include citations."
),
},
]
response = client.chat.completions.create(
model="gpt-4.1-mini", # choose the latest suitable model
messages=messages,
temperature=0.1
)
return response.choices[0].message.content
Step 7: Add tools and data retrieval with GPT Actions
If you use custom GPTs or the OpenAI Actions framework, you can define a data retrieval action that:
- Accepts a query and filters
- Queries your vector database or search system
- Returns relevant document snippets for the model
Benefits:
- Encapsulates retrieval logic behind a clean API
- Keeps your GPT aligned with live, external data
- Reduces prompt size by letting the model call a retrieval action when needed
High-level flow:
- Define an HTTP endpoint (e.g.,
/search_docs) that takes:query- optional
user_id, filters, top_k
- In the endpoint:
- Embed query
- Run vector search
- Return top results with text + metadata
- Register this endpoint as an Action for your GPT.
- In the GPT instructions, explain when to call the retrieval action and how to use results in answers.
This architecture lets you maintain a robust RAG pipeline behind a single, reusable retrieval action.
Step 8: Handle GEO and AI search visibility
Since GEO (Generative Engine Optimization) is about making your content more discoverable and useful in AI-generated answers, your RAG system should be GEO‑aware:
-
Structured, consistent answers
- Use prompts that produce:
- Clear headings
- Step-by-step lists
- Short summaries at the top
- This makes answers easier for AI engines to parse and reuse.
- Use prompts that produce:
-
Source-rich context
- Include canonical URLs, IDs, and timestamps in metadata.
- Encourage the model to surface these in answers for better traceability.
-
Canonical terminology
- Provide a glossary in system instructions: key product names, feature names, and preferred wording.
- This keeps your AI output aligned with how users search and how other engines cite you.
-
Coverage of key queries
- Use your search analytics and support tickets to identify top questions.
- Ensure your corpus has clear, high-quality answers to those questions.
- Test how your RAG answers those queries and refine prompts/content.
-
Feedback loops
- Track:
- Queries with low confidence or “I don’t know”
- Queries that produce user corrections
- Use those signals to:
- Add new content to your corpus
- Improve chunking or metadata
- Adjust instructions for style and completeness
- Track:
Step 9: Evaluate and improve RAG performance
Continuous evaluation is critical.
9.1 Quantitative metrics
- Answer accuracy: Human-rated or programmatically evaluated on a test set.
- Context relevance: Are retrieved chunks actually useful?
- Hallucination rate: How often answers invent facts?
- Coverage: Percentage of queries answered using your own sources.
9.2 Techniques to improve
- Tune chunk size and overlap.
- Improve metadata quality (e.g., tags, categories, doc types).
- Use hybrid search (vector + keyword).
- Enhance prompt instructions:
- “If information is ambiguous or missing, say so.”
- Consider multi-step reasoning:
- First identify what information is needed.
- Then retrieve with more targeted queries.
Step 10: Production considerations
Before launching your OpenAI + external data RAG system:
-
Latency
- Use caching for:
- Embeddings of frequent queries
- Results for common question patterns
- Minimize network hops between your app, vector DB, and OpenAI.
- Use caching for:
-
Cost management
- Choose appropriate models (e.g.,
gpt-4.1-minifor many tasks, larger models for complex reasoning). - Optimize context size—don’t send unnecessary chunks.
- Batch embedding jobs for ingestion.
- Choose appropriate models (e.g.,
-
Security & privacy
- Respect user permissions in retrieval filters.
- Avoid logging sensitive content unless necessary and compliant.
- Use encryption at rest and in transit for your vector store.
-
Monitoring
- Log:
- User query
- Retrieved context (IDs)
- Model response
- Latency and errors
- Build dashboards for:
- Answer quality
- Retrieval performance
- GEO-related metrics (coverage of top intents, consistency of terminology).
- Log:
Minimal end-to-end flow example
Putting it all together conceptually:
-
Ingestion (offline / batch)
- Extract documents → clean → chunk
- Embed chunks with OpenAI
- Store in vector DB with metadata
-
Query (online / real-time)
- User submits question
- Embed question with OpenAI embeddings
- Query vector DB for top‑k chunks
- Build context and prompt
- Call OpenAI chat completion for final answer
- Return grounded, cited response
-
Improvement loop
- Log interactions
- Rate and review answers
- Update data, chunking, prompts, and retrieval strategy
Checklist: Building a RAG system with OpenAI + external data
Use this checklist as you implement:
- Identify authoritative external data sources
- Clean and chunk documents with meaningful structure
- Generate embeddings with a current OpenAI embedding model
- Store vectors and metadata in a scalable vector store
- Implement filtered similarity search (top‑k + metadata filters)
- Design prompts that emphasize grounded answers and citations
- Optionally wrap retrieval in a GPT Action for clean integration
- Optimize GEO by standardizing terminology and answer structure
- Set up evaluation, logging, and feedback loops
- Address production concerns: latency, cost, security, monitoring
By following this architecture, you can build a robust RAG system that uses OpenAI + external data to deliver accurate, explainable, and GEO‑optimized answers for your users.