{"id":2360,"date":"2024-07-02T21:04:00","date_gmt":"2024-07-02T21:04:00","guid":{"rendered":"https:\/\/www.rainforestqa.com\/blog\/?p=2360"},"modified":"2024-07-09T19:41:07","modified_gmt":"2024-07-09T19:41:07","slug":"data-warehouse-can-rag","status":"publish","type":"post","link":"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag","title":{"rendered":"Surprise, your data warehouse can RAG"},"content":{"rendered":"\n<p>If you\u2019re one of the cool kids building AI-based products you\u2019ve probably heard of \u2014 or are already doing \u2014 RAG. If you\u2019re not, let me tell you what RAG is before telling you one weird fact about it.<\/p>\n\n\n\n<p>\u201cRetrieval-Augmented Generation\u201d is nothing more than a fancy way of saying \u201cincluding helpful information in your LLM prompt.\u201d Still, there are many ways to do it and many questions to answer when building a RAG pipeline. I mentioned some of these things in my recent article about <a href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents#Invest_in_RAG\" target=\"_blank\" rel=\"noreferrer noopener\">how to build reliable systems using unreliable agents<\/a>.<\/p>\n\n\n\n<p>If you\u2019ve looked into it, you might\u2019ve realized that setting up a RAG pipeline introduces a bunch of complexity to your already-complex AI system. The one weird trick and the good news I\u2019m bringing you is that your data warehouse provider already has a bunch of recently-added goodies to help you, which could significantly simplify your system.<\/p>\n\n\n\n<p>Besides the fact that most databases offer built-in full-text-search, there are now also utilities to compute embeddings and do vector search \u2014 and that\u2019s pretty much all you need for your RAG pipeline. Some even offer multimodal embeddings, so you can search for semantically-similar items across text and images.<\/p>\n\n\n\n<p>In this post I\u2019ll walk through the high-level steps of building a RAG pipeline, illustrate it by describing how we built ours, and then show how we improved it by relying more on built-in BigQuery features. (For context, we\u2019re building an <a href=\"https:\/\/www.rainforestqa.com\/ai-accelerated-testing\" target=\"_blank\" rel=\"noreferrer noopener\">AI agent system to write and update automated software test scripts<\/a>.)<\/p>\n\n\n\n<p>I\u2019ll focus on BigQuery, since that\u2019s what we use and have the most experience with. But the same general techniques are applicable to other products, too.<\/p>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#A_word_of_caution_Dont_do_RAG_too_early\" >A word of caution: Don\u2019t do RAG too early<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#Building_a_RAG_pipeline_101\" >Building a RAG pipeline 101<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#The_problem_How_to_find_and_retrieve_relevant_records\" >The problem: How to find and retrieve relevant records?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#Our_first_multi-system_implementation\" >Our first, multi-system implementation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#The_new_BigQuery-only_implementation\" >The new, BigQuery-only implementation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#Comparing_latency_and_cost\" >Comparing latency and cost<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#Embeddings_or_keyword_search\" >Embeddings or keyword search?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.rainforestqa.com\/blog\/data-warehouse-can-rag\/#Takeaways_and_summary\" >Takeaways and summary<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"A_word_of_caution_Dont_do_RAG_too_early\"><\/span>A word of caution: Don\u2019t do RAG too early<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>If your reaction to the above is \u201cWell, great, but I don\u2019t have a data warehouse,\u201d then I\u2019d suggest re-evaluating whether investing in RAG is a good idea at this point.<\/p>\n\n\n\n<p>Any experienced data scientist will tell you that before your org can do ML\/AI effectively, it needs to have a certain level of data infrastructure built up: logging, ETLs, analytics, dashboards, etc. While having a data warehouse isn\u2019t a requirement, it is a signal that your organization is far enough along on the data-maturity journey to be able to benefit from RAG.<\/p>\n\n\n\n<p>Not having a data warehouse like BigQuery, Snowflake, or similar might mean you\u2019re not yet at that point. This isn\u2019t a hard rule, obviously, just a point of caution.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Building_a_RAG_pipeline_101\"><\/span>Building a RAG pipeline 101<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Building a helpful RAG pipeline can be hard because it\u2019s not always obvious what success looks like. Since it has a non-zero cost, you should only use RAG in your LLM calls if it improves your results. So the first thing to think about is how to measure how good your results are.<\/p>\n\n\n\n<p>It might seem obvious, but it\u2019s actually a difficult problem most of the time. You can likely tell when looking at a given result whether it\u2019s good or bad, but that method\u2019s difficult to scale. Without an automatic way to measure success, you\u2019ll need to spend a bunch of time evaluating results after every change. There\u2019s no single answer, but <a href=\"https:\/\/hamel.dev\/blog\/posts\/evals\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hamel\u2019s post on evals<\/a> is a great starting point.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test your hypothesis<\/h3>\n\n\n\n<p>Whether you have an evaluation framework or you\u2019re running on vibes (which is sometimes a reasonable choice!), you need to have an idea about how RAG will help you. With your hypothesis made explicit, you can run experiments to check whether your hypothesis is true. The process might look like this:<\/p>\n\n\n\n<p><strong>1. Establish a baseline<\/strong><\/p>\n\n\n\n<p>Get your baseline by trying a few prompts \u2014 ideally from real users \u2014 and see how your system handles them. You\u2019ll probably find some work well and some don\u2019t \u2014 save these and treat them as your eval set. Run them each several times to get an idea about reliability\/flakiness.<\/p>\n\n\n\n<p><strong>2. Experiment with prompts<\/strong><\/p>\n\n\n\n<p>Without building anything, think about how you could possibly make your results better if you added the ideal context specific to each case. For any given prompt, what would be the most helpful information you could plausibly retrieve? Add that to the context manually and see if it makes the results better. It will if you have good intuition about the problem.<\/p>\n\n\n\n<p>If you can\u2019t make the results better in this idealized scenario, you just saved yourself a lot of work \u2014 no point building anything more until you pass this gate. Try to get an improvement for each example in your eval set before proceeding.<\/p>\n\n\n\n<p><strong>3. Build and test<\/strong><\/p>\n\n\n\n<p>Now that you have something concrete to aim for, build your first RAG pipeline and see whether it really improves your responses. At this point, you can also start measuring retrieval success in isolation. Use synthetic data to establish baseline recall and precision metrics, as discussed by Jason <a href=\"https:\/\/jxnl.co\/writing\/2024\/05\/11\/low-hanging-fruit-for-rag-search\/#1-synthetic-data-for-baseline-metrics\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Use text embeddings to search your data<\/h3>\n\n\n\n<p><em>Note: We opted to go with text embeddings instead of keyword search \u2014 I explore the tradeoffs in the \u2018Embeddings or keyword search?\u2019 section later in the piece.<\/em><\/p>\n\n\n\n<p>To use text embeddings to search through your data \u2014 which we\u2019ll call \u201cdocuments,\u201d because that\u2019s the name the NLP researchers use \u2014 you need to take a number of steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Collect all your documents somewhere, like on disk or in some kind of database for easy access<\/li>\n\n\n\n<li>Chunk the documents into embeddable pieces (deciding how to chunk depends on your use case \u2014 see below)<\/li>\n\n\n\n<li>Compute an embedding vector for each chunk<\/li>\n\n\n\n<li>Put all your chunks along with their embedding vectors into some kind of database<\/li>\n\n\n\n<li>Build\/refresh the index you\u2019ll use to find nearest-neighbor-vectors<\/li>\n\n\n\n<li>At query time, compute the embedding of your query and retrieve top-k closest matches from your index<\/li>\n\n\n\n<li>Include these matches in your LLM prompt<\/li>\n\n\n\n<li>Figure out how to refresh your index regularly as you get new data<\/li>\n<\/ol>\n\n\n\n<p><strong>A note on chunking<\/strong><\/p>\n\n\n\n<p>We lucked out in that the data we wanted to retrieve was already naturally chunked for our use case. Our tests are segmented into steps (where each step is usually a self-contained \u201ctask\u201d similar to the login example above), so we can embed each step separately.<\/p>\n\n\n\n<p>This is a little unusual. A more common scenario is having multiple long documents, like pages from a knowledge base, and trying to pull relevant information out of them. In that case, you\u2019ll need to decide whether you do fixed-length, semantic, or recursive chunking (see, e.g., <a href=\"https:\/\/www.pinecone.io\/learn\/chunking-strategies\/\" target=\"_blank\" rel=\"noreferrer noopener\">this post<\/a> for more specific pointers) or if you want to be fancy and try <a href=\"https:\/\/blog.vespa.ai\/rag-perspectives\/?:#:text=To%20chunk%20or%20not%20to%20chunk:~:text=To%20chunk%20or%20not%20to%20chunk\" target=\"_blank\" rel=\"noreferrer noopener\">multi-vector approach advocated by Vespa<\/a>.<\/p>\n\n\n\n<p>Let\u2019s walk through our initial implementation now before explaining how we improved it using built-in functionality of our data warehouse.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_problem_How_to_find_and_retrieve_relevant_records\"><\/span>The problem: How to find and retrieve relevant records?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>For the last 12 years, we\u2019ve been recording how human testers complete various testing tasks given by our customers. The retrieval problem we faced was finding relevant work records from the data we collected. E.g., given a task like <code>Log in using 'user01[@example.com]' and 'password01'<\/code> , we can see that a typical tester would perform the following inputs:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>click<\/li>\n\n\n\n<li>paste<\/li>\n\n\n\n<li>click<\/li>\n\n\n\n<li>paste<\/li>\n\n\n\n<li>click<\/li>\n<\/ul>\n\n\n\n<p>This information is really valuable when we <a href=\"https:\/\/www.rainforestqa.com\/blog\/generative-ai-test-automation\" target=\"_blank\" rel=\"noreferrer noopener\">automatically generate actions based on user prompts<\/a> \u2014 provided we surface the correct data at the right time. We have hundreds of millions of interactions like this, so finding the currently-relevant ones is critical.<\/p>\n\n\n\n<p>What we started with was whatever we could ship the fastest. Since we were already using the OpenAI API and I knew they had <a href=\"https:\/\/platform.openai.com\/docs\/guides\/embeddings\/embeddings\" target=\"_blank\" rel=\"noreferrer noopener\">an endpoint to compute embeddings<\/a>, the only remaining question was how to store and search through them. After a bit of research, it seemed like <a href=\"https:\/\/docs.pinecone.io\/guides\/get-started\/quickstart\" target=\"_blank\" rel=\"noreferrer noopener\">Pinecone<\/a> was the easiest to get started with \u2014 creating an index, adding data to it, and searching for K nearest neighbors was all straightforwardly done using their Python client without the need to set up any additional infrastructure.<\/p>\n\n\n\n<p>I did also consider <a href=\"https:\/\/github.com\/pgvector\/pgvector\/\" target=\"_blank\" rel=\"noreferrer noopener\">pgvector<\/a> because I really like the idea of keeping things simple. And one database is definitely simpler than two! However, there were a few reasons why using our main database didn\u2019t seem like a good idea:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We had to do a non-trivial amount of ETL to get the data into the right shape initially. This involved fairly intensive, spiky workloads reading data from S3, transforming it using <a href=\"https:\/\/cloud.google.com\/dataflow?hl=en\" target=\"_blank\" rel=\"noopener\">Dataflow<\/a>\/<a href=\"https:\/\/beam.apache.org\/\" target=\"_blank\" rel=\"noopener\">Beam<\/a>, and saving it in the final destination. Doing this with our main database as the destination would be risky performance-wise and generally a big operation on our main application.<\/li>\n\n\n\n<li>The code using RAG is in a separate worker process that isn\u2019t directly connected to our main database. Which means we\u2019d still need some kind of API interface for embedding search.<\/li>\n\n\n\n<li>It wasn\u2019t clear to me what the load impact would be of storing and regularly searching through millions of vectors. Our RAG pipeline isn\u2019t super performance-sensitive, but if it started affecting other features in our app, we\u2019d have a bad time. Isolating it to some other system we didn\u2019t have to worry about scaling seemed safer. At the same time, setting up another Postgres instance seemed unreasonable.<\/li>\n<\/ul>\n\n\n\n<p>In the first version, we also didn\u2019t worry too much about updating the index over time. This is the kind of thing that needs to be figured out eventually, but I was confident we\u2019d find <em>some<\/em> solution when it became necessary. Starting with a simple script that gets run offline was good enough.<\/p>\n\n\n\n<p>Luckily, we already had all the data we needed in one place \u2014 because we\u2019ve gone through the process of building a data warehouse, we had a scalable way of getting all historical data. This is one of the reasons I warned against thinking about RAG too early. If your org isn\u2019t ready, you\u2019ll face many more obstacles.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Our_first_multi-system_implementation\"><\/span>Our first, multi-system implementation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Pre-processing the data for aggregation and security<\/h3>\n\n\n\n<p>We did some pre-processing of this data, both for quality-of-information and privacy reasons. Look again at the task mentioned earlier: <code>Log in with 'user01[@example.com]' and 'password01'<\/code> . Each individual tester doing this task only gives us noisy signal about what\u2019s necessary to complete the task, but we\u2019ve got records of many testers completing the task \u2014 we can aggregate them together to establish the minimum necessary sequence of inputs required.<\/p>\n\n\n\n<p>Of course, if someone writes an equivalent task using different login credentials, we\u2019d theoretically treat these two tasks separately. Luckily, we already have a way to recognize \u201ctest data\u201d like credentials, so we replace them all with a token like <code>&lt;variable&gt;<\/code> . As a result, we get better aggregation (i.e., more example records per prompt) and we avoid sending user-specific data anywhere.<\/p>\n\n\n\n<p>How exactly we do this aggregation is beyond the scope of this article, but in short we look for the minimum reasonable set of interactions that seem to be necessary to complete the task. (It\u2019s basically a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Longest_common_subsequence\" target=\"_blank\" rel=\"noreferrer noopener\">longest common subsequence problem<\/a>.)<\/p>\n\n\n\n<p>At this point, we have a table of useful information that looks something like this:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>prompt<\/th><th>input sequence<\/th><\/tr><\/thead><tbody><tr><td>Login with &lt;variable&gt; and &lt;variable&gt;<\/td><td>click<br>paste<br>click<br>paste<br>click<\/td><\/tr><tr><td>Add this item to cart<\/td><td>click<\/td><\/tr><tr><td>Go to wikipedia.org and search for \u201crainforest\u201d<\/td><td>navigate<br>paste<br>click<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Except assume that each input sequence is a distillation of many human testers completing this task, and that the table has millions of rows. How do we find the rows most similar to the user prompt we just got, like <code>sign in using &lt;variable&gt;\/&lt;variable&gt;<\/code>? This is the data we have available in our BigQuery warehouse. The only remaining question is which three rows are the most relevant to the task at hand and should be added to the context?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Computing text embeddings<\/h3>\n\n\n\n<p>First, we run through the whole table and use the OpenAI embedding endpoint mentioned above to compute an embedding vector for each text, effectively turning the table above into the table below. For reasons I\u2019ll explain shortly, we also add an ID column, which is just a SHA256 hash of the prompt.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>prompt_hash<\/th><th>prompt<\/th><th>prompt_embedding<\/th><th>input sequence<\/th><\/tr><\/thead><tbody><tr><td><code>b36ce6ef...<\/code><\/td><td>Login with &lt;variable&gt; and &lt;variable&gt;<\/td><td><code>-0.0362, 0.02814, -0.0096, -0.0247, -0.0004, ...<\/code><\/td><td>click<br>paste<br>click<br>paste<br>click<\/td><\/tr><tr><td><code>8b149cc2...<\/code><\/td><td>add this item to cart<\/td><td><code>-0.0074, 0.0085, -0.0516, 0.0149, -0.021, ...<\/code><\/td><td>click<\/td><\/tr><tr><td><code>214ab350...<\/code><\/td><td>go to wikipedia.org and search for \u201crainforest\u201d<\/td><td><code>-0.00274, 0.0042, -0.0355, -0.0143, -0.0021, ...<\/code><\/td><td>navigate<br>paste<br>click<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>There are a couple embedding models to choose from, and we started with <code>text-embedding-3-small<\/code> because our steps are reasonably short and it doesn\u2019t feel like we need very high \u201cresolution\u201d to tell similar prompts apart. It\u2019s good to experiment with both the larger model and cutting down the embedding size for efficiency, but these are both optimizations you can worry about later.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Retrieving related tests steps<\/h3>\n\n\n\n<p>Once we have the embeddings for each prompt, we need to make them searchable. This is where Pinecone comes in. It gives us an easy way to upload a bunch of embedding vectors together with some metadata and then search through them. To upload, we create an \u201cindex\u201d and then call the following to populate it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pinecone.grpc import PineconeGRPC as Pinecone\n\npc = Pinecone(api_key='YOUR_API_KEY')\n\nindex = pc.Index(\"test_steps\")\nindex.upsert(namespace=\"ns1\", vectors=&#91;\n  {\"id\": \"b36ce6ef...\", \"values\": &#91;-0.0362, 0.02814, -0.0096, ...]},\n  {\"id\": \"8b149cc2...\", \"values\": &#91;-0.0074, 0.0085, -0.0516, ...]},\n  {\"id\": \"214ab350...\", \"values\": &#91;-0.00274, 0.0042, -0.0355, ...]},\n])<\/code><\/pre>\n\n\n\n<p>Now, to guide our generation agent, we can take a prompt that a user has written, find three closest matches using the Pinecone index, and map that back to the actions that human testers performed. Specifically, we do this something like the following<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pinecone.grpc import PineconeGRPC as Pinecone\n\npc = Pinecone(api_key='YOUR_API_KEY')\nindex = pc.Index(\"test_steps\")\n\ndef get_embedding(text):\n    # remove unwanted values\n    preprocessed_text = preprocess(text)\n    # embed the resulting text using the OpenAI API\n    # https:\/\/platform.openai.com\/docs\/api-reference\/embeddings\/create\n    return OpenAI().embeddings.create(\n\t    input=preprocessed_text,\n\t    model=\"text-embedding-3-small\"\n\t   ).data&#91;0].embedding\n    \nquery_result = index.query(\n    vector=get_embedding(text),\n    top_k=3,\n    include_values=False,\n    include_metadata=True,\n)\nprompt_hashes = &#91;match.id for match in query_result.matches]<\/code><\/pre>\n\n\n\n<p>At this point, all we need to do is map back from the hashes to the prompts and their corresponding actions. We can include those actions in the LLM context. And we\u2019re done!<\/p>\n\n\n\n<p>The above means we have to deal with three different systems when retrieving the context:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OpenAI to get embeddings for the user\u2019s task<\/li>\n\n\n\n<li>Pinecone to get the closest neighbors<\/li>\n\n\n\n<li>BigQuery to get the final interactions<\/li>\n<\/ul>\n\n\n\n<p>That has a few downsides: we need to work with three different providers, we end up sending data to multiple places, and the latency cost is non-negligible. (There\u2019s also the upside of being flexible and able to swap components easily, but this isn\u2019t something we\u2019re optimizing for at this point.)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_new_BigQuery-only_implementation\"><\/span>The new, BigQuery-only implementation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>It turns out that all major data warehouse providers have recently been hard at work implementing LLM-related features to address the <s>hype<\/s> popularity of AI-adjacent topics. I\u2019m going to discuss BigQuery since that\u2019s what we use at Rainforest, but I\u2019ll also include links to other providers\u2019 docs at the bottom of the post.<\/p>\n\n\n\n<p>Specifically, BigQuery now allows you to <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/bigqueryml-syntax-generate-embedding#textembedding-gecko_1\" target=\"_blank\" rel=\"noreferrer noopener\">compute text embeddings<\/a> (and even multimodal text + image embeddings!), <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-index#create_a_vector_index\" target=\"_blank\" rel=\"noreferrer noopener\">create vector indexes<\/a>, and do <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-search-intro\" target=\"_blank\" rel=\"noreferrer noopener\">nearest-neighbor vector search<\/a>. Which is everything we need to rely solely on BigQuery and eliminate Pinecone and the OpenAI embedding API from our pipeline. This makes the system simpler.<\/p>\n\n\n\n<p>One note of caution: as I\u2019m writing this, these BigQuery features are still in preview and some more advanced aspects \u2014 like stored columns in vector indexes and pre-filtering rows before vector search \u2014 are even <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-index#stored-columns\" target=\"_blank\" rel=\"noreferrer noopener\">gated by access lists<\/a>. Plan accordingly!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pre-processing, revisited<\/h3>\n\n\n\n<p>In the previous iteration, we did some pre-processing on the task text before embedding it. Because we used the same Python code at build- and query-time, we could use a single Python function and call it from two places. Think of these steps as just a bunch of regex operations, e.g.:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def sanitize(text: str) -&gt; str:\n    return re.sub(r\"\\{\\{.*?\\}\\}\", \"&lt;variable&gt;\", text)<\/code><\/pre>\n\n\n\n<p>Now, things are a little different. At query time, we still have the unprocessed text in Python and can process it the same way, but at index build time, we\u2019re inside BigQuery. The most obvious thing to do would be to duplicate the pre-processing logic and have the equivalent function in BigQuery. If the logic is simple, this is workable but still not great \u2014 if anything changes, we need to remember to update both the Python and the SQL code or things get out of sync. We could only do the transformation in SQL, but there are two separate queries, so we\u2019d still need to duplicate the logic.<\/p>\n\n\n\n<p>Thankfully, there\u2019s a neat way to solve this: <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/user-defined-functions\" target=\"_blank\" rel=\"noreferrer noopener\">user-defined functions<\/a> (UDFs). And <a href=\"https:\/\/discourse.getdbt.com\/t\/using-dbt-to-manage-user-defined-functions\/18\" target=\"_blank\" rel=\"noreferrer noopener\">dbt happens to support them<\/a> well. Which means we can define the pre-processing logic in a single place and then simply call it whenever needed. The way you define a UDF in dbt is:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE OR REPLACE FUNCTION schema.sanitize(action STRING) RETURNS STRING AS (\n  REGEXP_REPLACE(action, r'\\\\{\\\\{.*?\\\\}\\\\}', '&lt;variable&gt;')\n);<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Computing text embeddings<\/h3>\n\n\n\n<p>If you want to get a feel for how this works, I\u2019d recommend following <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-index-text-search-tutorial\" target=\"_blank\" rel=\"noreferrer noopener\">this tutorial<\/a>. In our case, after the initial setup in GCP to create a model, adding a Vertex connection to our BigQuery data warehouse, etc., we also had to create a continuously-updating pipeline.<\/p>\n\n\n\n<p>We use <a href=\"https:\/\/www.getdbt.com\/\" target=\"_blank\" rel=\"noopener\">dbt<\/a> for our data pipelines, so this was just a matter of defining a new dbt model that reflects the table you\u2019ve seen above \u2014 except now it\u2019s an actual table wholly contained inside BigQuery instead of split between there and Pinecone. The table schema looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>prompt<\/th><th>prompt_embedding<\/th><th>input sequence<\/th><\/tr><\/thead><tbody><tr><td>Login with &lt;variable&gt; and &lt;variable&gt;<\/td><td><code>-0.0362, 0.02814, -0.0096, -0.0247, -0.0004, ...<\/code><\/td><td>click<br>paste<br>click<br>paste<br>click<\/td><\/tr><tr><td>add this item to cart<\/td><td><code>-0.0074, 0.0085, -0.0516, 0.0149, -0.021, ...<\/code><\/td><td>click<\/td><\/tr><tr><td>go to wikipedia.org and search for \u201crainforest\u201d<\/td><td><code>-0.00274, 0.0042, -0.0355, -0.0143, -0.0021, ...<\/code><\/td><td>navigate<br>paste<br>click<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Retrieving related test steps<\/h3>\n\n\n\n<p>Because the prompt embedding column has a vector index on it, we can pretty quickly find K nearest neighbors even among millions of rows using <code>VECTOR_SEARCH<\/code>. One nice thing about this is that BigQuery will automatically manage the index for you, so you don\u2019t have to worry about keeping it fresh, re-building, etc. In fact, if you do try to rebuild the index, it\u2019ll become unavailable for however long that takes and all the queries trying to use it\u2019ll hang until it\u2019s rebuilt. (Ask me how I know!)<\/p>\n\n\n\n<p>Specifically, we can run the following query to get the closest examples and their input sequences. (Note the <code>sanitize<\/code> UDF call I mentioned above.)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT\n\tbase.prompt AS related_prompt,\n\tbase.input_sequence\nFROM VECTOR_SEARCH(\n\tTABLE `project.dataset.table`,\n\t'column',\n\t(\n\t  SELECT ml_generate_embedding_result, content AS prompt\n\t  FROM ML.GENERATE_EMBEDDING(\n\t\t  MODEL `project.embeddings.text_embedding_model`,\n\t\t  (SELECT schema.sanitize('sign in using &lt;variable&gt;\/&lt;variable&gt;') AS content)\n\t\t)\n\t),\n\ttop_k =&gt; 3\n)<\/code><\/pre>\n\n\n\n<p>One thing you have to be careful about is quotas, so be sure to <a href=\"https:\/\/cloud.google.com\/bigquery\/quotas\" target=\"_blank\" rel=\"noreferrer noopener\">check the relevant numbers<\/a>. Specifically: the quotas for concurrent requests are low, but each request can calculate many embeddings, so if you need to calculate many embeddings concurrently, it\u2019s better to batch them up into fewer large requests.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Comparing_latency_and_cost\"><\/span>Comparing latency and cost<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Surprisingly, we found that running the whole pipeline in BigQuery takes basically the same amount of time as the serial OpenAI \u2192 Pinecone \u2192 BigQuery Frankenstein system. In both cases, it\u2019s about four seconds, end-to-end. Latency isn\u2019t a big deal for us, since our application isn\u2019t super latency-sensitive and it\u2019s easy to add caching in both cases, making the impact negligible. Still, I didn\u2019t expect this to be the case. I thought having everything in BigQuery would be significantly faster! Props to both OpenAI and Pinecone for their speedy APIs for embedding and search, respectively.<\/p>\n\n\n\n<p>As for cost, both solutions are roughly equivalent, ending up with somewhere around $0.01 per retrieval, taking into account all of embedding and search.<\/p>\n\n\n\n<p>So if we didn\u2019t gain anything in terms of latency and cost, is the new pipeline still worth it? In our case, dealing with a single system \u2014 including billing, rate limits and auth considerations \u2014 simplifies the development work enough that it\u2019s still worth doing. We also get continuous updates of the relevant datasets for free (by hooking into our existing data refresh process), whereas before we\u2019d need to set up some kind of cron job.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Embeddings_or_keyword_search\"><\/span>Embeddings or keyword search?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>I\u2019ve spent the whole post talking about semantic search using text embeddings, but that\u2019s not the only possible solution. In fact, jumping straight to embeddings and vector search is considered a common mistake, as mentioned, e.g., in <a href=\"https:\/\/www.oreilly.com\/radar\/what-we-learned-from-a-year-of-building-with-llms-part-i\/#:~:text=don%E2%80%99t%20forget%20keyword%20search%3B%20use%20it%20as%20a%20baseline%20and%20in%20hybrid%20search\" target=\"_blank\" rel=\"noreferrer noopener\">this post<\/a>.<\/p>\n\n\n\n<p>The <em>classic<\/em> solution to retrieving relevant documents from some collection given a query is keyword search. Despite some obvious problems (like not dealing with synonyms) keyword search is simple and well-explored \u2014 you don\u2019t need to figure out how to compute the embeddings, how to store them, and how to do nearest-neighbor search.<\/p>\n\n\n\n<p>Any data warehouse will have some full-text-search capabilities. For example, BigQuery has <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/search_functions#search\" target=\"_blank\" rel=\"noreferrer noopener\">the <code>SEARCH<\/code> function<\/a>, which is a decent starting point. Keep in mind that it\u2019ll only find examples that contain <em>all<\/em> the terms in your query, so it\u2019s going to miss some potentially relevant documents. Using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" target=\"_blank\" rel=\"noreferrer noopener\">BM25<\/a> to rank documents by relevance is better, though it\u2019s not available everywhere.<\/p>\n\n\n\n<p>There\u2019s also \u201chybrid search,\u201d where you combine results from keyword and embedding searches, possibly paired with choosing the most relevant chunks using something like <a href=\"https:\/\/docs.cohere.com\/docs\/rerank-2\" target=\"_blank\" rel=\"noreferrer noopener\">Cohere Reranking<\/a>.<\/p>\n\n\n\n<p>Given your synthetic benchmarks, are you retrieving the correct information given a query? Even imperfect retrieval might be better than none, especially if your precision is OK. As long as you\u2019re not retrieving <em>wrong<\/em> info and confusing your LLM you might prefer to do the simplest possible thing and not increase complexity unnecessarily.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Takeaways_and_summary\"><\/span>Takeaways and summary<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Overall, it\u2019s still early in the vectors-in-data-warehouses story and things are changing rapidly. If you already have a RAG pipeline, it\u2019s probably worth revisiting how it\u2019s built and whether it could be simplified. If you\u2019re just starting out, do whatever is simplest \u2014 going from nothing to anything in the RAG department is the biggest bang for the buck, and you can worry about optimizations down the line. I suspect using an existing data warehouse might be the simplest solution for some organizations, but that definitely won\u2019t be true for everyone.<\/p>\n\n\n\n<p>We\u2019re happy we built the first version of our RAG pipeline quickly using off-the-shelf flexible components, but it was also good to simplify and optimize. We now have a pipeline that\u2019s tightly integrated with our regular data stack and we can largely not worry in terms of refreshing indexes and keeping it running.<\/p>\n\n\n\n<p>Finally, I also promised to highlight how to do similar things with other data warehouses, so let\u2019s go through those:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Snowflake has a way to <a href=\"https:\/\/www.snowflake.com\/blog\/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">create embeddings<\/a> and Cortex looks like it has both keyword and semantic search.<\/li>\n\n\n\n<li>AWS has a way to <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/getting-started-with-amazon-titan-text-embeddings\/\" target=\"_blank\" rel=\"noreferrer noopener\">create text embeddings<\/a> using Bedrock and OpenSearch can do <a href=\"https:\/\/aws.amazon.com\/opensearch-service\/serverless-vector-engine\/\" target=\"_blank\" rel=\"noreferrer noopener\">vector search<\/a>.<\/li>\n\n\n\n<li>Microsoft\u2019s Azure has a way to get OpenAI embeddings and do vector search using <a href=\"https:\/\/microsoft.github.io\/SynapseML\/docs\/Explore%20Algorithms\/OpenAI\/Quickstart%20-%20OpenAI%20Embedding\/\" target=\"_blank\" rel=\"noreferrer noopener\">Synapse<\/a>.<\/li>\n<\/ul>\n\n\n\n<p>Let me know what your experience has been! I\u2019m <a href=\"https:\/\/twitter.com\/maciejgryka\" target=\"_blank\" rel=\"noreferrer noopener\">@maciejgryka on (Twitter)<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to use your data warehouse&#8217;s built-in features to simplify and potentially improve your RAG pipeline.<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","inline_featured_image":false,"footnotes":""},"categories":[6],"tags":[15],"class_list":["post-2360","post","type-post","status-publish","format-standard","hentry","category-engineering","tag-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/2360","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/comments?post=2360"}],"version-history":[{"count":12,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/2360\/revisions"}],"predecessor-version":[{"id":2397,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/2360\/revisions\/2397"}],"wp:attachment":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/media?parent=2360"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/categories?post=2360"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/tags?post=2360"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}