Surprise, your data warehouse can RAG

Contents

If you’re one of the cool kids building AI-based products you’ve probably heard of — or are already doing — RAG. If you’re not, let me tell you what RAG is before telling you one weird fact about it.

“Retrieval-Augmented Generation” is nothing more than a fancy way of saying “including helpful information in your LLM prompt.” Still, there are many ways to do it and many questions to answer when building a RAG pipeline. I mentioned some of these things in my recent article about how to build reliable systems using unreliable agents.

If you’ve looked into it, you might’ve realized that setting up a RAG pipeline introduces a bunch of complexity to your already-complex AI system. The one weird trick and the good news I’m bringing you is that your data warehouse provider already has a bunch of recently-added goodies to help you, which could significantly simplify your system.

Besides the fact that most databases offer built-in full-text-search, there are now also utilities to compute embeddings and do vector search — and that’s pretty much all you need for your RAG pipeline. Some even offer multimodal embeddings, so you can search for semantically-similar items across text and images.

In this post I’ll walk through the high-level steps of building a RAG pipeline, illustrate it by describing how we built ours, and then show how we improved it by relying more on built-in BigQuery features. (For context, we’re building an AI agent system to write and update automated software test scripts.)

I’ll focus on BigQuery, since that’s what we use and have the most experience with. But the same general techniques are applicable to other products, too.

Contents

A word of caution: Don’t do RAG too early

If your reaction to the above is “Well, great, but I don’t have a data warehouse,” then I’d suggest re-evaluating whether investing in RAG is a good idea at this point.

Any experienced data scientist will tell you that before your org can do ML/AI effectively, it needs to have a certain level of data infrastructure built up: logging, ETLs, analytics, dashboards, etc. While having a data warehouse isn’t a requirement, it is a signal that your organization is far enough along on the data-maturity journey to be able to benefit from RAG.

Not having a data warehouse like BigQuery, Snowflake, or similar might mean you’re not yet at that point. This isn’t a hard rule, obviously, just a point of caution.

Building a RAG pipeline 101

Building a helpful RAG pipeline can be hard because it’s not always obvious what success looks like. Since it has a non-zero cost, you should only use RAG in your LLM calls if it improves your results. So the first thing to think about is how to measure how good your results are.

It might seem obvious, but it’s actually a difficult problem most of the time. You can likely tell when looking at a given result whether it’s good or bad, but that method’s difficult to scale. Without an automatic way to measure success, you’ll need to spend a bunch of time evaluating results after every change. There’s no single answer, but Hamel’s post on evals is a great starting point.

Test your hypothesis

Whether you have an evaluation framework or you’re running on vibes (which is sometimes a reasonable choice!), you need to have an idea about how RAG will help you. With your hypothesis made explicit, you can run experiments to check whether your hypothesis is true. The process might look like this:

1. Establish a baseline

Get your baseline by trying a few prompts — ideally from real users — and see how your system handles them. You’ll probably find some work well and some don’t — save these and treat them as your eval set. Run them each several times to get an idea about reliability/flakiness.

2. Experiment with prompts

Without building anything, think about how you could possibly make your results better if you added the ideal context specific to each case. For any given prompt, what would be the most helpful information you could plausibly retrieve? Add that to the context manually and see if it makes the results better. It will if you have good intuition about the problem.

If you can’t make the results better in this idealized scenario, you just saved yourself a lot of work — no point building anything more until you pass this gate. Try to get an improvement for each example in your eval set before proceeding.

3. Build and test

Now that you have something concrete to aim for, build your first RAG pipeline and see whether it really improves your responses. At this point, you can also start measuring retrieval success in isolation. Use synthetic data to establish baseline recall and precision metrics, as discussed by Jason here.

Use text embeddings to search your data

Note: We opted to go with text embeddings instead of keyword search — I explore the tradeoffs in the ‘Embeddings or keyword search?’ section later in the piece.

To use text embeddings to search through your data — which we’ll call “documents,” because that’s the name the NLP researchers use — you need to take a number of steps:

Collect all your documents somewhere, like on disk or in some kind of database for easy access
Chunk the documents into embeddable pieces (deciding how to chunk depends on your use case — see below)
Compute an embedding vector for each chunk
Put all your chunks along with their embedding vectors into some kind of database
Build/refresh the index you’ll use to find nearest-neighbor-vectors
At query time, compute the embedding of your query and retrieve top-k closest matches from your index
Include these matches in your LLM prompt
Figure out how to refresh your index regularly as you get new data

A note on chunking

We lucked out in that the data we wanted to retrieve was already naturally chunked for our use case. Our tests are segmented into steps (where each step is usually a self-contained “task” similar to the login example above), so we can embed each step separately.

This is a little unusual. A more common scenario is having multiple long documents, like pages from a knowledge base, and trying to pull relevant information out of them. In that case, you’ll need to decide whether you do fixed-length, semantic, or recursive chunking (see, e.g., this post for more specific pointers) or if you want to be fancy and try multi-vector approach advocated by Vespa.

Let’s walk through our initial implementation now before explaining how we improved it using built-in functionality of our data warehouse.

The problem: How to find and retrieve relevant records?

For the last 12 years, we’ve been recording how human testers complete various testing tasks given by our customers. The retrieval problem we faced was finding relevant work records from the data we collected. E.g., given a task like Log in using 'user01[@example.com]' and 'password01' , we can see that a typical tester would perform the following inputs:

click
paste
click
paste
click

This information is really valuable when we automatically generate actions based on user prompts — provided we surface the correct data at the right time. We have hundreds of millions of interactions like this, so finding the currently-relevant ones is critical.

What we started with was whatever we could ship the fastest. Since we were already using the OpenAI API and I knew they had an endpoint to compute embeddings, the only remaining question was how to store and search through them. After a bit of research, it seemed like Pinecone was the easiest to get started with — creating an index, adding data to it, and searching for K nearest neighbors was all straightforwardly done using their Python client without the need to set up any additional infrastructure.

I did also consider pgvector because I really like the idea of keeping things simple. And one database is definitely simpler than two! However, there were a few reasons why using our main database didn’t seem like a good idea:

We had to do a non-trivial amount of ETL to get the data into the right shape initially. This involved fairly intensive, spiky workloads reading data from S3, transforming it using Dataflow/Beam, and saving it in the final destination. Doing this with our main database as the destination would be risky performance-wise and generally a big operation on our main application.
The code using RAG is in a separate worker process that isn’t directly connected to our main database. Which means we’d still need some kind of API interface for embedding search.
It wasn’t clear to me what the load impact would be of storing and regularly searching through millions of vectors. Our RAG pipeline isn’t super performance-sensitive, but if it started affecting other features in our app, we’d have a bad time. Isolating it to some other system we didn’t have to worry about scaling seemed safer. At the same time, setting up another Postgres instance seemed unreasonable.

In the first version, we also didn’t worry too much about updating the index over time. This is the kind of thing that needs to be figured out eventually, but I was confident we’d find some solution when it became necessary. Starting with a simple script that gets run offline was good enough.

Luckily, we already had all the data we needed in one place — because we’ve gone through the process of building a data warehouse, we had a scalable way of getting all historical data. This is one of the reasons I warned against thinking about RAG too early. If your org isn’t ready, you’ll face many more obstacles.

Our first, multi-system implementation

Pre-processing the data for aggregation and security

We did some pre-processing of this data, both for quality-of-information and privacy reasons. Look again at the task mentioned earlier: Log in with 'user01[@example.com]' and 'password01' . Each individual tester doing this task only gives us noisy signal about what’s necessary to complete the task, but we’ve got records of many testers completing the task — we can aggregate them together to establish the minimum necessary sequence of inputs required.

Of course, if someone writes an equivalent task using different login credentials, we’d theoretically treat these two tasks separately. Luckily, we already have a way to recognize “test data” like credentials, so we replace them all with a token like <variable> . As a result, we get better aggregation (i.e., more example records per prompt) and we avoid sending user-specific data anywhere.

How exactly we do this aggregation is beyond the scope of this article, but in short we look for the minimum reasonable set of interactions that seem to be necessary to complete the task. (It’s basically a longest common subsequence problem.)

At this point, we have a table of useful information that looks something like this:

prompt	input sequence
Login with <variable> and <variable>	click paste click paste click
Add this item to cart	click
Go to wikipedia.org and search for “rainforest”	navigate paste click

Except assume that each input sequence is a distillation of many human testers completing this task, and that the table has millions of rows. How do we find the rows most similar to the user prompt we just got, like sign in using <variable>/<variable>? This is the data we have available in our BigQuery warehouse. The only remaining question is which three rows are the most relevant to the task at hand and should be added to the context?

Computing text embeddings

First, we run through the whole table and use the OpenAI embedding endpoint mentioned above to compute an embedding vector for each text, effectively turning the table above into the table below. For reasons I’ll explain shortly, we also add an ID column, which is just a SHA256 hash of the prompt.

prompt_hash	prompt	prompt_embedding	input sequence
`b36ce6ef...`	Login with <variable> and <variable>	`-0.0362, 0.02814, -0.0096, -0.0247, -0.0004, ...`	click paste click paste click
`8b149cc2...`	add this item to cart	`-0.0074, 0.0085, -0.0516, 0.0149, -0.021, ...`	click
`214ab350...`	go to wikipedia.org and search for “rainforest”	`-0.00274, 0.0042, -0.0355, -0.0143, -0.0021, ...`	navigate paste click

There are a couple embedding models to choose from, and we started with text-embedding-3-small because our steps are reasonably short and it doesn’t feel like we need very high “resolution” to tell similar prompts apart. It’s good to experiment with both the larger model and cutting down the embedding size for efficiency, but these are both optimizations you can worry about later.

Retrieving related tests steps

Once we have the embeddings for each prompt, we need to make them searchable. This is where Pinecone comes in. It gives us an easy way to upload a bunch of embedding vectors together with some metadata and then search through them. To upload, we create an “index” and then call the following to populate it:

from pinecone.grpc import PineconeGRPC as Pinecone

pc = Pinecone(api_key='YOUR_API_KEY')

index = pc.Index("test_steps")
index.upsert(namespace="ns1", vectors=[
  {"id": "b36ce6ef...", "values": [-0.0362, 0.02814, -0.0096, ...]},
  {"id": "8b149cc2...", "values": [-0.0074, 0.0085, -0.0516, ...]},
  {"id": "214ab350...", "values": [-0.00274, 0.0042, -0.0355, ...]},
])

Now, to guide our generation agent, we can take a prompt that a user has written, find three closest matches using the Pinecone index, and map that back to the actions that human testers performed. Specifically, we do this something like the following

from pinecone.grpc import PineconeGRPC as Pinecone

pc = Pinecone(api_key='YOUR_API_KEY')
index = pc.Index("test_steps")

def get_embedding(text):
    # remove unwanted values
    preprocessed_text = preprocess(text)
    # embed the resulting text using the OpenAI API
    # https://platform.openai.com/docs/api-reference/embeddings/create
    return OpenAI().embeddings.create(
	    input=preprocessed_text,
	    model="text-embedding-3-small"
	   ).data[0].embedding
    
query_result = index.query(
    vector=get_embedding(text),
    top_k=3,
    include_values=False,
    include_metadata=True,
)
prompt_hashes = [match.id for match in query_result.matches]

At this point, all we need to do is map back from the hashes to the prompts and their corresponding actions. We can include those actions in the LLM context. And we’re done!

The above means we have to deal with three different systems when retrieving the context:

OpenAI to get embeddings for the user’s task
Pinecone to get the closest neighbors
BigQuery to get the final interactions

That has a few downsides: we need to work with three different providers, we end up sending data to multiple places, and the latency cost is non-negligible. (There’s also the upside of being flexible and able to swap components easily, but this isn’t something we’re optimizing for at this point.)

The new, BigQuery-only implementation

It turns out that all major data warehouse providers have recently been hard at work implementing LLM-related features to address the ~~hype~~ popularity of AI-adjacent topics. I’m going to discuss BigQuery since that’s what we use at Rainforest, but I’ll also include links to other providers’ docs at the bottom of the post.

Specifically, BigQuery now allows you to compute text embeddings (and even multimodal text + image embeddings!), create vector indexes, and do nearest-neighbor vector search. Which is everything we need to rely solely on BigQuery and eliminate Pinecone and the OpenAI embedding API from our pipeline. This makes the system simpler.

One note of caution: as I’m writing this, these BigQuery features are still in preview and some more advanced aspects — like stored columns in vector indexes and pre-filtering rows before vector search — are even gated by access lists. Plan accordingly!

Pre-processing, revisited

In the previous iteration, we did some pre-processing on the task text before embedding it. Because we used the same Python code at build- and query-time, we could use a single Python function and call it from two places. Think of these steps as just a bunch of regex operations, e.g.:

def sanitize(text: str) -> str:
    return re.sub(r"\{\{.*?\}\}", "<variable>", text)

Now, things are a little different. At query time, we still have the unprocessed text in Python and can process it the same way, but at index build time, we’re inside BigQuery. The most obvious thing to do would be to duplicate the pre-processing logic and have the equivalent function in BigQuery. If the logic is simple, this is workable but still not great — if anything changes, we need to remember to update both the Python and the SQL code or things get out of sync. We could only do the transformation in SQL, but there are two separate queries, so we’d still need to duplicate the logic.

Thankfully, there’s a neat way to solve this: user-defined functions (UDFs). And dbt happens to support them well. Which means we can define the pre-processing logic in a single place and then simply call it whenever needed. The way you define a UDF in dbt is:

CREATE OR REPLACE FUNCTION schema.sanitize(action STRING) RETURNS STRING AS (
  REGEXP_REPLACE(action, r'\\{\\{.*?\\}\\}', '<variable>')
);

Computing text embeddings

If you want to get a feel for how this works, I’d recommend following this tutorial. In our case, after the initial setup in GCP to create a model, adding a Vertex connection to our BigQuery data warehouse, etc., we also had to create a continuously-updating pipeline.

We use dbt for our data pipelines, so this was just a matter of defining a new dbt model that reflects the table you’ve seen above — except now it’s an actual table wholly contained inside BigQuery instead of split between there and Pinecone. The table schema looks like this:

prompt	prompt_embedding	input sequence
Login with <variable> and <variable>	`-0.0362, 0.02814, -0.0096, -0.0247, -0.0004, ...`	click paste click paste click
add this item to cart	`-0.0074, 0.0085, -0.0516, 0.0149, -0.021, ...`	click
go to wikipedia.org and search for “rainforest”	`-0.00274, 0.0042, -0.0355, -0.0143, -0.0021, ...`	navigate paste click

Retrieving related test steps

Because the prompt embedding column has a vector index on it, we can pretty quickly find K nearest neighbors even among millions of rows using VECTOR_SEARCH. One nice thing about this is that BigQuery will automatically manage the index for you, so you don’t have to worry about keeping it fresh, re-building, etc. In fact, if you do try to rebuild the index, it’ll become unavailable for however long that takes and all the queries trying to use it’ll hang until it’s rebuilt. (Ask me how I know!)

Specifically, we can run the following query to get the closest examples and their input sequences. (Note the sanitize UDF call I mentioned above.)

SELECT
	base.prompt AS related_prompt,
	base.input_sequence
FROM VECTOR_SEARCH(
	TABLE `project.dataset.table`,
	'column',
	(
	  SELECT ml_generate_embedding_result, content AS prompt
	  FROM ML.GENERATE_EMBEDDING(
		  MODEL `project.embeddings.text_embedding_model`,
		  (SELECT schema.sanitize('sign in using <variable>/<variable>') AS content)
		)
	),
	top_k => 3
)

One thing you have to be careful about is quotas, so be sure to check the relevant numbers. Specifically: the quotas for concurrent requests are low, but each request can calculate many embeddings, so if you need to calculate many embeddings concurrently, it’s better to batch them up into fewer large requests.

Comparing latency and cost

Surprisingly, we found that running the whole pipeline in BigQuery takes basically the same amount of time as the serial OpenAI → Pinecone → BigQuery Frankenstein system. In both cases, it’s about four seconds, end-to-end. Latency isn’t a big deal for us, since our application isn’t super latency-sensitive and it’s easy to add caching in both cases, making the impact negligible. Still, I didn’t expect this to be the case. I thought having everything in BigQuery would be significantly faster! Props to both OpenAI and Pinecone for their speedy APIs for embedding and search, respectively.

As for cost, both solutions are roughly equivalent, ending up with somewhere around $0.01 per retrieval, taking into account all of embedding and search.

So if we didn’t gain anything in terms of latency and cost, is the new pipeline still worth it? In our case, dealing with a single system — including billing, rate limits and auth considerations — simplifies the development work enough that it’s still worth doing. We also get continuous updates of the relevant datasets for free (by hooking into our existing data refresh process), whereas before we’d need to set up some kind of cron job.

Embeddings or keyword search?

I’ve spent the whole post talking about semantic search using text embeddings, but that’s not the only possible solution. In fact, jumping straight to embeddings and vector search is considered a common mistake, as mentioned, e.g., in this post.

The classic solution to retrieving relevant documents from some collection given a query is keyword search. Despite some obvious problems (like not dealing with synonyms) keyword search is simple and well-explored — you don’t need to figure out how to compute the embeddings, how to store them, and how to do nearest-neighbor search.

Any data warehouse will have some full-text-search capabilities. For example, BigQuery has the SEARCH function, which is a decent starting point. Keep in mind that it’ll only find examples that contain all the terms in your query, so it’s going to miss some potentially relevant documents. Using BM25 to rank documents by relevance is better, though it’s not available everywhere.

There’s also “hybrid search,” where you combine results from keyword and embedding searches, possibly paired with choosing the most relevant chunks using something like Cohere Reranking.

Given your synthetic benchmarks, are you retrieving the correct information given a query? Even imperfect retrieval might be better than none, especially if your precision is OK. As long as you’re not retrieving wrong info and confusing your LLM you might prefer to do the simplest possible thing and not increase complexity unnecessarily.

Takeaways and summary

Overall, it’s still early in the vectors-in-data-warehouses story and things are changing rapidly. If you already have a RAG pipeline, it’s probably worth revisiting how it’s built and whether it could be simplified. If you’re just starting out, do whatever is simplest — going from nothing to anything in the RAG department is the biggest bang for the buck, and you can worry about optimizations down the line. I suspect using an existing data warehouse might be the simplest solution for some organizations, but that definitely won’t be true for everyone.

We’re happy we built the first version of our RAG pipeline quickly using off-the-shelf flexible components, but it was also good to simplify and optimize. We now have a pipeline that’s tightly integrated with our regular data stack and we can largely not worry in terms of refreshing indexes and keeping it running.

Finally, I also promised to highlight how to do similar things with other data warehouses, so let’s go through those:

Snowflake has a way to create embeddings and Cortex looks like it has both keyword and semantic search.
AWS has a way to create text embeddings using Bedrock and OpenSearch can do vector search.
Microsoft’s Azure has a way to get OpenAI embeddings and do vector search using Synapse.

Let me know what your experience has been! I’m @maciejgryka on (Twitter).

A word of caution: Don’t do RAG too early

Building a RAG pipeline 101

Test your hypothesis

Use text embeddings to search your data

The problem: How to find and retrieve relevant records?

Our first, multi-system implementation

Pre-processing the data for aggregation and security

Computing text embeddings

Retrieving related tests steps

The new, BigQuery-only implementation

Pre-processing, revisited

Computing text embeddings

Retrieving related test steps

Comparing latency and cost

Embeddings or keyword search?

Takeaways and summary

Related articles

Find more bugs, faster, without adding headcount