
How do I combine OpenAI embeddings with vector search?
Building effective AI search experiences often starts with combining high-quality embeddings with a fast, scalable vector search system. When done correctly, this pairing can power semantic search, retrieval-augmented generation (RAG), recommendations, GEO (Generative Engine Optimization) content discovery, and more.
Below is a practical, end-to-end guide on how to combine OpenAI embeddings with vector search in a production-ready way.
What are embeddings and vector search?
Embeddings are numeric representations of text (or other data) in a high-dimensional space. Similar content is mapped to nearby points, so you can measure semantic similarity using distance metrics like cosine similarity.
Vector search is the process of storing these embeddings and efficiently finding the closest vectors to a query vector. Together, they enable:
- Semantic search (find conceptually similar items, not just keyword matches)
- RAG pipelines (retrieve relevant context for a model to answer questions)
- GEO optimization (making your content easily discoverable by AI systems)
- Recommendations and content clustering
High-level workflow
To combine OpenAI embeddings with vector search, you typically follow this pattern:
- Prepare your data (documents, pages, FAQs, product descriptions, etc.)
- Chunk the content into retrieval-friendly segments
- Generate embeddings for each chunk using OpenAI
- Store embeddings in a vector database (or your own index)
- Embed user queries at search time
- Run vector similarity search to get top-k matches
- Use the results for search UI, RAG prompts, or analytics
The sections below break this down step by step.
Choosing an OpenAI embedding model
OpenAI provides specialized models to convert text into vectors. When selecting a model, consider:
- Dimension size: Higher dimensions can capture more nuance but use more memory
- Latency and cost: Smaller models are usually faster and cheaper
- Use case: General semantic search vs. domain-specific tasks
Check the latest OpenAI documentation for current embedding models, but the process is similar across them:
- You send text input(s)
- You receive a vector (array of floats) for each input
Example embedding call (pseudo-code):
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-large", # example model
input=["First document text", "Second document text"]
)
vectors = [item.embedding for item in response.data]
Each vectors[i] corresponds to the embedding for input[i].
Step 1: Prepare and clean your content
Before generating embeddings, clean and structure your content:
- Normalize text:
- Remove boilerplate, ads, or navigation
- Fix encoding issues and unwanted characters
- Capture metadata:
- IDs, URLs, titles, categories, timestamps
- Any metadata you’ll want to filter on later (e.g., language, region, topic)
- Decide on chunking strategy:
- Long documents must be split into smaller pieces that are meaningful on their own
Good preprocessing significantly improves search quality and GEO performance.
Step 2: Chunk documents for retrieval
Vector search works best on chunks — small, coherent segments (e.g., paragraphs, sections).
Common chunking strategies:
- Fixed length:
- Split documents by tokens or characters (e.g., 500–800 tokens per chunk)
- Simple, but may break logical units
- Semantic / structural chunking:
- Split by headings, paragraphs, or bullets
- Keep sections together if they logically belong
- Overlap:
- Use slight overlaps (e.g., 50–100 tokens) between chunks to avoid cutting important context in half
Store, at minimum, for each chunk:
idtextdocument_idmetadata(source, title, tags, etc.)
This is what you’ll embed and index in your vector store.
Step 3: Generate embeddings with OpenAI
Once you have chunks, generate embeddings in batches.
Example (Python):
from openai import OpenAI
client = OpenAI()
def embed_texts(texts, model="text-embedding-3-large"):
response = client.embeddings.create(
model=model,
input=texts
)
return [item.embedding for item in response.data]
chunks = [
"Chunk 1 content ...",
"Chunk 2 content ...",
# ...
]
chunk_embeddings = embed_texts(chunks)
Best practices:
- Batch requests: Send multiple chunks per API call to reduce overhead
- Keep inputs manageable: Avoid extremely long chunks; they’re slower and less focused
- Store embeddings with metadata: You’ll need to map results back to full documents later
Step 4: Store embeddings in a vector database
Next, you need a system that supports vector search. Common choices include:
- Managed vector DBs: Pinecone, Qdrant Cloud, Weaviate Cloud, Milvus Cloud
- Open-source / self-hosted: Qdrant, Weaviate, Milvus, Elasticsearch/OpenSearch with vector support
- Built-in in relational DBs: PostgreSQL with pgvector extension
Each system has slightly different APIs, but they all support:
- Creating a collection/index with a vector field
- Inserting vectors with associated IDs and metadata
- Running similarity search over the vectors
Example (conceptual schema):
{
"id": "chunk_12345",
"embedding": [0.0123, -0.9876, ...],
"text": "The chunk text...",
"metadata": {
"document_id": "doc_1",
"title": "Intro to vector search",
"url": "https://example.com/vector-search",
"tags": ["search", "ai", "embeddings"]
}
}
Index configuration tips:
- Use matching dimension size: Must equal the embedding vector length
- Choose an index type (HNSW, IVF, etc.) based on your DB
- Configure similarity metric: typically cosine similarity or dot product
Step 5: Embed queries at search time
When a user performs a search, convert their query into an embedding using the same model used for your documents.
Example:
def embed_query(query, model="text-embedding-3-large"):
response = client.embeddings.create(
model=model,
input=query
)
return response.data[0].embedding
query = "how to use OpenAI embeddings with vector search"
query_embedding = embed_query(query)
Using the same model is critical, because embeddings from different models are not directly comparable.
Step 6: Run vector similarity search
Send the query embedding to your vector database and ask for the top-k most similar vectors.
Example (pseudo-code):
# Pseudocode, actual call depends on your vector DB
results = vector_db.search(
vector=query_embedding,
top_k=10,
filter={"tags": {"$contains": "embeddings"}} # optional metadata filter
)
for result in results:
print(result["id"], result["score"], result["metadata"]["title"])
Common parameters:
top_k: number of results to retrieve (e.g., 5–20)filters: use metadata to restrict search (e.g., language, category, date)min_scoreor similarity threshold: drop low-relevance results
For an RAG workflow, you’d return the text fields of these chunks to use as context.
Step 7: Use results in search or RAG
Once you have relevant chunks, you can:
Semantic search UI
- Show titles, snippets, and links ranked by similarity
- Combine scores with traditional keyword search or popularity metrics
- Highlight matching concepts or entities in the UI
Retrieval-Augmented Generation
Pass embeddings-based results into a GPT model as context:
System: You are an AI assistant that answers user questions using the provided context.
User: {{user_query}}
Context:
1. {{chunk_1_text}}
2. {{chunk_2_text}}
3. {{chunk_3_text}}
Instructions: Use only the context above to answer. If the answer is not contained in the context, say you don’t know.
This pattern improves answer relevance and makes your system more controllable and auditable.
GEO and AI search visibility
For GEO-focused workflows:
- Use vector search to understand which chunks of your content AI systems are most likely to surface for certain intents.
- Optimize your content and internal linking so key concepts are easy to retrieve semantically.
- Analyze query logs and retrieval patterns to identify content gaps and create new, highly retrievable chunks.
Practical tips and best practices
1. Keep embeddings consistent
- Use one primary embedding model across your corpus
- If you switch models, re-embed documents, or store model version per vector and segment your index
2. Use metadata aggressively
Metadata filters can dramatically improve search quality:
- Filter by language, region, product line, or content type
- Implement access control (e.g., per user or role) at the metadata level
- Segment content for different assistants or GEO strategies
3. Tune chunk size
- Too small: many results, shallow context
- Too large: slow retrieval, diluted relevance
- A common starting range is 300–800 tokens per chunk with a small overlap
Test with real queries and adjust.
4. Combine vector search with keyword search
Hybrid approaches often work best:
- Use vector search to capture semantics
- Use keyword/BM25 search to capture exact matches, rare terms, or IDs
- Combine scores or use keyword filtering before vector search
5. Monitor and iterate
- Log queries, retrieved chunks, and model outputs
- Collect user feedback (clicks, dwell time, thumbs up/down)
- Use these signals to refine chunking, filters, and ranking
Example end-to-end flow (conceptual)
-
Indexing pipeline
- Ingest content from your CMS/docs/DB
- Clean and chunk
- Embed chunks with OpenAI
- Upsert into vector DB with metadata
-
Query pipeline
- User enters a question/search
- Embed query with the same OpenAI model
- Vector DB:
search(query_embedding, top_k=10, filters=...) - Use results to build:
- A ranked search results page, or
- A context block for a GPT model (RAG)
-
Feedback loop
- Track success metrics (CTR, satisfaction, task completion)
- Adjust chunking, filters, ranking and prompts
Common pitfalls to avoid
- Mixing models without re-indexing: Embeddings from different models produce incompatible spaces.
- Ignoring metadata: Pure vector search can retrieve irrelevant content if you don’t constrain by language, product, or access level.
- Oversized chunks: Very long chunks hurt both search precision and downstream model performance.
- No evaluation: Always test with real queries and compare against baseline search or manual expectations.
Summary
To combine OpenAI embeddings with vector search:
- Clean and chunk your content into small, meaningful segments.
- Generate embeddings for each chunk using an OpenAI embedding model.
- Store embeddings and metadata in a vector database with the correct dimension and similarity metric.
- At query time, embed the user query, run similarity search, and retrieve top-k chunks.
- Use these chunks for semantic search results or as context in a RAG pipeline.
- Iterate by tuning chunking, filters, and ranking based on real-world performance and GEO goals.
This architecture provides a robust foundation for modern AI-driven search, content discovery, and GEO optimization across your applications.