Press "Enter" to skip to content

RAG Patterns for Enterprise: What I’ve Learned Building Real Solutions

0

I’m writing this from my hotel room after day four of Microsoft Ignite 2024 in Chicago, and my head is absolutely stuffed with announcements. Azure AI Foundry just got revealed, there are new model capabilities everywhere, and every other session mentioned RAG. But here’s what got me thinking: most of those sessions made RAG sound clean and simple. Build an index, connect an LLM, done. And I know from the past year of actually building this stuff that it’s… not that.

I’ve been working with RAG patterns since early 2023 when we first started integrating Azure OpenAI into CloudLabs (I wrote about that here and the AI features we built). We’ve gone through multiple iterations, thrown away entire approaches, and learned a bunch of lessons the hard way. So I wanted to write down what I actually know, not the conference demo version.

What Is RAG, in Plain English?

If you’ve been anywhere near enterprise AI in 2024, you’ve heard the term. But I still run into people who are fuzzy on what it actually means, so let me explain it the way I explain it to non-technical stakeholders at Spektra.

Think about GPT-4o. It knows a ton of stuff from its training data, but it doesn’t know anything about YOUR company’s internal documents, YOUR product manuals, or YOUR support tickets. RAG is a pattern where you fetch relevant information from your own data and stuff it into the prompt alongside the user’s question. The LLM then generates an answer using that context.

That’s it. Retrieval Augmented Generation. You retrieve some context, augment the prompt with it, and then generate a response.

The alternative is fine-tuning, where you actually retrain the model on your data. Fine-tuning has its place, but it’s expensive, slow, and your data gets stale between training runs. RAG lets you keep your data current because the retrieval happens at query time. For most enterprise use cases I’ve seen, RAG is the right starting point.

The Basic Architecture

Here’s the typical Azure-centric RAG setup. This is what we run:

  1. Your source documents live in Azure Blob Storage (or SharePoint, or a database, whatever)
  2. An indexing pipeline chunks those documents, generates embeddings using an embedding model from Azure OpenAI, and stores everything in Azure AI Search (which used to be called Azure Cognitive Search, they renamed it earlier this year)
  3. When a user asks a question, the question gets embedded using the same model
  4. Azure AI Search finds the most relevant chunks using vector similarity
  5. Those chunks get injected into the prompt sent to GPT-4o via Azure OpenAI
  6. GPT-4o generates an answer grounded in the retrieved context

On paper, that’s six steps and it sounds straightforward. In practice, steps 2 and 4 are where all the complexity hides.

Chunking: Where Most People Get It Wrong

I cannot overstate how much chunking strategy matters. This is the thing that makes the biggest difference between a RAG system that works and one that gives garbage answers.

When I say "chunking," I mean how you break up your source documents before indexing them. You can’t just throw a 50-page PDF into a vector database as one blob. The embedding model has a token limit, and even if it didn’t, the LLM context window would fill up immediately. So you split documents into smaller pieces.

The question is: how?

Fixed-size chunks

The lazy approach. Split every 500 tokens with 100 tokens of overlap. This is what most tutorials show you. It works okay for homogeneous text like blog posts or wiki pages. It falls apart on anything structured, like technical documentation with tables, code blocks, or nested headers.

We started here. Lasted about two weeks before the answer quality drove us crazy.

Semantic chunking

Split at natural boundaries: paragraphs, sections, heading changes. This respects the structure of the document and keeps related information together. Way better for technical content.

Azure AI Search has built-in text splitting skills that can do this, but honestly we ended up writing custom chunking logic because every document type has different structure. Lab guides are structured differently from API docs, which are structured differently from support tickets.

What I’d recommend

Start with 800-1000 token chunks with 200 tokens of overlap, split on paragraph boundaries. That’s a solid default. Then look at your worst-performing queries and figure out if chunking is the problem. Nine times out of ten, when answers are bad, it’s either a chunking issue or a retrieval issue. Not the LLM.

One specific gotcha: tables. If you have tabular data in your documents and your chunker splits a table across two chunks, the LLM gets half a table and gives nonsense answers. We had to add special handling to keep tables intact as single chunks even if they exceed the target size.

Choosing Your Embedding Model

As of November 2024, you have a few options on Azure OpenAI:

  • text-embedding-ada-002 (the original, 1536 dimensions)
  • text-embedding-3-small (released January 2024, 1536 dimensions, cheaper)
  • text-embedding-3-large (released January 2024, 3072 dimensions, best quality)

We started with ada-002 back in 2023 because it was the only option. When the v3 models came out, we tested all three on our actual data. For our use case (technical documentation retrieval), text-embedding-3-large gave about 8-12% better retrieval accuracy than ada-002 on our benchmark queries. text-embedding-3-small was within 2-3% of ada-002 but costs less.

My recommendation: use text-embedding-3-large if retrieval quality is your priority and you can afford the extra storage for the higher-dimensional vectors. Use text-embedding-3-small if you’re cost-sensitive. Skip ada-002 for new projects.

One thing people forget: once you pick an embedding model, you’re committed. You can’t mix embeddings from different models in the same index. Switching models means re-embedding your entire corpus. We learned this the annoying way when we wanted to upgrade from ada-002.

Hybrid Search: The Thing That Actually Made Our Results Good

Pure vector search sounds cool and works well in demos. In production, with real enterprise documents, it has blind spots. Vector search is great at finding semantically similar content but it can miss exact matches. If someone searches for "error code AKS-0429" and that exact string is in your docs, vector search might not surface it because the embedding doesn’t capture the specificity of that code.

That’s where hybrid search comes in. Azure AI Search supports combining vector search with traditional keyword (BM25) search in a single query. It runs both, then fuses the results using Reciprocal Rank Fusion (RRF).

This was the single biggest improvement we made to our CloudLabs AI assistant. When we switched from pure vector to hybrid search, our retrieval precision on technical queries improved by roughly 20%. Especially for queries containing error messages, specific Azure resource names, or CLI commands.

Turn on hybrid search. Just do it. The latency difference is negligible and the quality improvement is real.

Azure AI Search also has semantic ranking, which re-ranks results using a separate Microsoft-hosted model. We tested it and saw another 5-8% improvement on top of hybrid search for ambiguous natural language queries. It does add latency (about 200-400ms) and costs extra, so evaluate whether it’s worth it for your use case. For us it was.

Some of the Gotchas

I’ve built enough RAG systems now to have a decent list of things that bite you.

Chunk size vs. context window tradeoffs. If your chunks are too big, you fit fewer of them in the prompt, which means less diversity of information. If they’re too small, you lose context within each chunk. There’s no magic number. We settled on 800-1000 tokens per chunk and retrieve the top 5-8 chunks per query, which fills about 25-30% of GPT-4o’s context window and leaves room for the system prompt and conversation history.

Stale indexes are sneaky. If your source documents change and you don’t re-index, your RAG system confidently gives outdated answers. We set up incremental indexing on a schedule. For frequently updated docs, the pipeline runs every 4 hours. For stable content, weekly.

Token costs add up fast. Embedding your corpus is a one-time cost (per model version). But every query hits both the embedding model and GPT-4o. At scale, the GPT-4o calls dominate the cost. We implemented a response cache keyed on the combination of query embedding similarity and retrieved chunk IDs. If we’ve seen a very similar question with the same context before, we return the cached response. This cut our Azure OpenAI spend by about 35%.

Citations are harder than you think. Users want to know WHERE the answer came from. You need to track which chunks contributed to each response and map them back to source documents, page numbers, section headers. We store chunk metadata (source URL, page number, section title) alongside the embeddings in Azure AI Search and pass that metadata through to the response. It took more engineering effort than the actual RAG pipeline.

A Practical Getting Started Framework

If you’re starting from scratch, here’s how I’d approach it:

Phase 1: Prove it works. Pick one document collection (maybe 50-100 documents). Use Azure AI Search with the default text splitting, text-embedding-3-large for embeddings, and GPT-4o for generation. Build the simplest possible pipeline. Don’t worry about production concerns yet. Just get answers flowing and see if they’re useful. This should take 1-2 weeks.

Phase 2: Make it good. Write 30-50 test questions and expected answers. Measure retrieval quality (are the right chunks coming back?) and answer quality (is the generated answer correct?). This is where you’ll discover your chunking needs to be smarter, you need hybrid search, and your prompts need work. Budget 2-4 weeks.

Phase 3: Make it production-ready. Add authentication, monitoring, cost management, incremental indexing, response caching, citation tracking, and error handling. This is the longest phase and the one most teams underestimate. At least 4-6 weeks, and you’ll keep iterating.

Don’t skip phase 2. Seriously. I’ve seen teams go from "look it works!" directly to production and then wonder why users complain the answers are wrong. Measurement is everything.

What I’m Watching

Microsoft just announced Azure AI Foundry here at Ignite, which is supposed to be the unified platform for building AI apps. I haven’t gotten my hands on it yet, but from the sessions I attended, the "prompt flow" feature for building and testing RAG pipelines looked promising. Could simplify a lot of what I described above.

I’m also prototyping with GraphRAG, which Microsoft Research released a few months ago. It builds a knowledge graph from your documents instead of relying on plain vector similarity. For complex multi-hop questions where basic RAG struggles, the early results are interesting.

And OpenAI o1 (released in September) is worth watching for the reasoning step after retrieval. A model that can actually reason through contradictory chunks instead of just summarizing them? That could fix some of the harder failure modes we see.

Resources

If you’ve been building RAG systems, I’d love to hear what patterns are working for you. What chunking strategy are you using? Have you tried hybrid search? Drop me a comment or reach out.

Happy building, folks!

Amit

Assisted by AI during writing

Leave a Reply

Your email address will not be published. Required fields are marked *