{"id":1986,"date":"2024-04-03T20:34:00","date_gmt":"2024-04-03T20:34:00","guid":{"rendered":"https:\/\/www.rainforestqa.com\/blog\/?p=1986"},"modified":"2024-06-10T22:56:50","modified_gmt":"2024-06-10T22:56:50","slug":"building-reliable-systems-out-of-unreliable-agents","status":"publish","type":"post","link":"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents","title":{"rendered":"Building reliable systems out of unreliable agents"},"content":{"rendered":"\n<p>If you\u2019ve tried building real-world features with AI, chances are that you\u2019ve experienced reliability issues. It\u2019s common knowledge that AI makes for great demos, but\u2026&nbsp;<em>questionable<\/em>&nbsp;products. After getting uncannily correct answers at first, you get burned on reliability with some wild output and decide you can\u2019t make anything useful out of that.<\/p>\n\n\n\n<p>Well, I\u2019m here to tell you that there\u2019s hope. Even though AI agents are&nbsp;not&nbsp;reliable, it&nbsp;is&nbsp;possible to build reliable systems out of them.<\/p>\n\n\n\n<p>These learnings come from a years-long process of creating&nbsp;<a href=\"https:\/\/www.rainforestqa.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">an AI QA tool<\/a>. While building, we found a process that worked pretty well for us and we\u2019re sharing it here. As a summary, it consists of these high-level steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Write simple prompts to solve your problem<\/li>\n\n\n\n<li>Use that experience to build an eval system to do prompt engineering and improve performance in a principled way<\/li>\n\n\n\n<li>Deploy your AI system with good observability, and use that signal to keep gathering examples and improving your evals<\/li>\n\n\n\n<li>Invest in Retrieval Augmented Generation (RAG)<\/li>\n\n\n\n<li>Fine-tune your model using the data you gathered from earlier steps<\/li>\n<\/ul>\n\n\n\n<p>Having worked on this problem for a while, I think these are the best practices every team should adopt. But there\u2019s an additional approach we came up with that gave us a breakthrough in reliability, and it might be a good fit for your product, too:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use complementary agents<\/li>\n<\/ul>\n\n\n\n<p>The principle behind this step is simple: it\u2019s possible to build&nbsp;systems&nbsp;of complementary agents that work much more reliably than a single agent. More on that later.<\/p>\n\n\n\n<p>Before we jump in, a note on who this is for: you don\u2019t need much AI experience to follow the process we lay out here. In fact, most of our team while building our QA agent didn\u2019t have previous AI or ML experience. A solid software engineering background, however, is super helpful \u2014 a sentiment <a href=\"https:\/\/twitter.com\/gdb\/status\/1729893902814192096\" target=\"_blank\" rel=\"noreferrer noopener\">echoed<\/a> by well-known people in the industry. Particularly, thinking deeply about how to test what you\u2019re building and constantly finding ways to optimize your workflow are really important.<\/p>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Start_with_simple_prompts\" >Start with simple prompts<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Use_an_eval_system_to_do_prompt_engineering\" >Use an eval system to do prompt engineering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Improve_with_observability\" >Improve with observability<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Invest_in_RAG\" >Invest in RAG<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Fine-tune_your_model\" >Fine-tune your model<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Use_complementary_agents\" >Use complementary agents<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.rainforestqa.com\/blog\/building-reliable-systems-out-of-unreliable-agents\/#Final_notes\" >Final notes<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Start_with_simple_prompts\"><\/span>Start with simple prompts<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The most obvious way to start using an LLM to solve a problem is simply asking it to do the thing in your own words. This approach often works well enough at the beginning to give you hope, but starts falling down as soon as you want any reliability. The answers you get might be mostly correct, but not good enough for production. And you\u2019ll quickly notice scenarios where the answers are consistently wrong.<\/p>\n\n\n\n<p>The best LLMs today are amazing generalists, but not very good specialists \u2014 and generally, you want specialists to solve your business problems. They need to have enough general knowledge to not be tedious, but at the same time they need to know how exactly to handle the specifics in the gray areas of your problem space.<\/p>\n\n\n\n<p>Let\u2019s take a trivial example: you want to get a list of ingredients needed to prepare different things to eat. You start with the first thing that comes to mind:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"916\" height=\"1024\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2-916x1024.jpg\" alt=\"\" class=\"wp-image-1989\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2-916x1024.jpg 916w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2-268x300.jpg 268w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2-768x858.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2-1374x1536.jpg 1374w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj1-2.jpg 1426w\" sizes=\"(max-width: 916px) 100vw, 916px\" \/><\/figure>\n\n\n\n<p>This is good \u2014 the list of ingredients is right there. But there\u2019s also a bunch of other stuff that you don\u2019t need. For example, you might use the same knife for both jars and not care for being lectured about \u200cknife hygiene. You can fiddle with the prompt pretty easily:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"638\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj2-2-1024x638.jpg\" alt=\"\" class=\"wp-image-1990\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj2-2-1024x638.jpg 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj2-2-300x187.jpg 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj2-2-768x478.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj2-2.jpg 1362w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>Better, but there\u2019s still some unnecessary commentary there. Let&#8217;s have another shot:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"442\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj3-2-1024x442.jpg\" alt=\"\" class=\"wp-image-1991\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj3-2-1024x442.jpg 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj3-2-300x130.jpg 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj3-2-768x332.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj3-2.jpg 1172w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>OK, I guess that\u2019ll do. Now we just need to make it JSON so we can integrate it with the rest of our product. Also, just to be safe, let\u2019s run it a few times to make sure it\u2019s reliable:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"572\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj4-2-1024x572.jpg\" alt=\"\" class=\"wp-image-1992\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj4-2-1024x572.jpg 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj4-2-300x167.jpg 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj4-2-768x429.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj4-2.jpg 1394w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"619\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj5-2-1024x619.jpg\" alt=\"\" class=\"wp-image-1993\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj5-2-1024x619.jpg 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj5-2-300x181.jpg 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj5-2-768x464.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj5-2.jpg 1396w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"587\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj6-2-1024x587.jpg\" alt=\"\" class=\"wp-image-1994\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj6-2-1024x587.jpg 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj6-2-300x172.jpg 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj6-2-768x440.jpg 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2024\/04\/pbj6-2.jpg 1378w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>As you can see, we\u2019re reliably getting JSON out, but it\u2019s not consistent: the capitalization of the keys and what&#8217;s included in each value varies. These specific problems are easy to deal with and even this might be good enough for your use case, but we\u2019re still far from the reliable and reproducible behavior we\u2019re looking for.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Minimum necessary product integration<\/h3>\n\n\n\n<p>At this point, it\u2019s time to integrate a minimal version of your \u201cAI component\u201d with your product, so the next step is to start using the API instead of the console. Grab your favorite LLM-provider client (or just use their API \u2014 there are <a href=\"https:\/\/x.com\/simonw\/status\/1728141822063767857?s=20\" target=\"_blank\" rel=\"noreferrer noopener\">some good reasons to stick with HTTP<\/a>) and integrate it into your product in the most minimal way possible. The point is to start building out the infrastructure, knowing that the results won&#8217;t be great yet.<\/p>\n\n\n\n<p>Some cheat codes at this point, based on my experience:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If you use <code>mypy<\/code>, you might be tempted to use strongly-typed inputs when interacting with the client (e.g., the OpenAI client), but my advice would be: don\u2019t. While I like having <code>mypy<\/code> around, it\u2019s much easier to work with plain dictionaries to build your messages and you\u2019re not risking a lot of bugs.<\/li>\n\n\n\n<li>In my experience, it\u2019s a good idea to set <code>temperature=0.0<\/code> in all your model calls if you care about reliability. You still won\u2019t get perfect reproducibility, but it\u2019s usually the best place to start your explorations.<\/li>\n\n\n\n<li>If you\u2019re thinking about using a wrapper like <a href=\"https:\/\/github.com\/jxnl\/instructor\" target=\"_blank\" rel=\"noreferrer noopener\"><code>instructor<\/code><\/a> to get structured data out of the LLM: it\u2019s <em>really<\/em> cool and makes some use-cases very smooth, but also makes your code a little less flexible. I\u2019d usually start without it and then bring it in at a later point, once I\u2019m confident in the shape of my data.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Use_an_eval_system_to_do_prompt_engineering\"><\/span>Use an eval system to do prompt engineering<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The first thing you should try after the naive \u201cask a simple question\u201d approach is prompt engineering. While the phrase is common, I don\u2019t think many people have an accurate definition of what \u201cprompt engineering\u201d actually means, so let\u2019s define it first.<\/p>\n\n\n\n<p>When I say \u201cprompt engineering,\u201d I mean something like, \u201citerative improvement of a prompt based on measurable success criteria\u201d, where \u201citerative\u201d and \u201cmeasurable success criteria\u201d are the key phrases. (I like&nbsp;<a href=\"https:\/\/mitchellh.com\/writing\/prompt-engineering-vs-blind-prompting\" target=\"_blank\" rel=\"noreferrer noopener\">this post<\/a>&nbsp;from last year as an early attempt to define this.) The key is to have some way of determining whether an answer you get from an LLM is correct and then measuring correctness across examples to give you a number you can compare over time.<\/p>\n\n\n\n<p>Create an evaluation loop so you have a way of checking any change you make, and then make that loop as fast as it can be so you can iterate effectively. For an overview of how to think about your eval systems, see&nbsp;<a href=\"https:\/\/hamel.dev\/blog\/posts\/evals\/\" target=\"_blank\" rel=\"noreferrer noopener\">this excellent blog post<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Evaluating when there are multiple correct answers<\/h3>\n\n\n\n<p>This whole procedure rhymes with \u201ctesting\u201d and \u201ccollecting a validation set,\u201d except for any given question there might be multiple correct answers, so it\u2019s not obvious how to do this. Here are some situations and how to deal with them:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It\u2019s possible to apply deterministic transformations on the output of the LLM before you compare it to the \u201cknown good\u201d answer. An example might be the capitalization issue from earlier, or maybe you only ever want three first words as an output. Running these transformations will make it trivial to compare what you get with what you expect.<\/li>\n\n\n\n<li>You might be able to use some heuristics to validate your output. E.g., if you\u2019re working on summarization, you might be able to say something like \u201cto summarize this story accurately, the following words are absolutely necessary\u201d and then get away with doing string matching on the response.<\/li>\n\n\n\n<li>Maybe you need the output in a certain format. E.g., you\u2019re getting function calls or their arguments from an LLM or you\u2019re expecting country codes or other well-defined outputs. In these cases, you can validate what you get against a known schema and retry in case of errors. In practice, you can use something like&nbsp;<a href=\"https:\/\/github.com\/jxnl\/instructor\" target=\"_blank\" rel=\"noreferrer noopener\">instructor<\/a> \u2014 if you\u2019re comfortable with the constraints it imposes, including around code flexibility \u2014 and then you\u2019re left with straightforwardly comparing structured data.<\/li>\n\n\n\n<li>In a true Inception fashion, you might want to use a simpler, smaller, and cheaper LLM to evaluate the outputs of your big-LLM-using-component. Comparing two differently-written, but equivalent lists of ingredients is an easy task even for something like GPT 3.5 Turbo. Just keep in mind: even thought it\u2019s pretty reliable, you\u2019re now introducing <em>some<\/em> flakiness into your test suite. Trade-offs!<\/li>\n\n\n\n<li>To evaluate an answer, you might have to \u201cexecute\u201d the entire set of instructions the agent gives you and check if you\u2019ve reached the goal. This is more complex and time-consuming, but sometimes the only way. For example, we often ask an agent to achieve a goal inside a browser by outputting a series of instructions that might span multiple screens. The only way for us to check its answer is to execute the instructions inside Playwright and run some assertions on the final state.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Building your validation set<\/h3>\n\n\n\n<p>Once you have your evaluation loop nailed down, you can build a validation set of example inputs and the corresponding outputs you&#8217;d like the agent to produce.<\/p>\n\n\n\n<p>As you evaluate different strategies and make changes to your prompts, it\u2019s ideal if you have a single metric to compare over time. Something like \u201c% prompts answered correctly\u201d is the most obvious, but something like precision\/recall might be more informative depending on your use case.<\/p>\n\n\n\n<p>It\u2019s also possible you won\u2019t be able to use a single metric if you\u2019re evaluating fuzzy properties of your answers, in which case you can at least look at what breaks after each change and make a judgment call.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prompt engineering tricks<\/h3>\n\n\n\n<p>Now\u2019s your chance to try all the prompt engineering tricks you\u2019ve heard about. Let\u2019s cover some of them!<\/p>\n\n\n\n<p>First of all, provide all the context you\u2019d need to give to an intelligent human operator who\u2019s unfamiliar with the nuances and requirements of the task. This is a necessary (but not sufficient!) condition to making your system work. E.g., if you know you absolutely always want some salted butter under your peanut butter, you need to include that information in your prompt.<\/p>\n\n\n\n<p>If you\u2019re not getting the correct responses, ask the agent to think step-by-step before providing the actual answer. This can be a little tricky if you\u2019re expecting structured data out \u2014 you\u2019ll have to somehow give the agent a way to do some reasoning&nbsp;<em>before<\/em> it makes any consequential decisions and locks itself in.<\/p>\n\n\n\n<p>E.g., if you use the <a href=\"https:\/\/platform.openai.com\/docs\/api-reference\/chat\/create#chat-create-tools\" target=\"_blank\" rel=\"noreferrer noopener\">tool-calling API<\/a>, which is really nifty, you might be tempted to tell the agent to do some chain-of-thought reasoning by having a JSON schema similar to this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;\n    {\n        \"type\": \"function\",\n        \"function\": {\n            \"name\": \"scoop_out\",\n            \"description\": \"scoop something out\",\n            \"parameters\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"chain_of_thought\": {\n                        \"type\": \"string\",\n                        \"description\": \"the reasoning for this action\"\n                    },\n                    \"jar\": {\n                        \"type\": \"string\",\n                        \"description\": \"the jar to scoop out of\"\n                    },\n                    \"amount\": {\n                        \"type\": \"integer\",\n                        \"description\": \"how much to scoop out\"\n                    }\n                },\n                \"required\": &#91;\"chain_of_thought\", \"jar\",\"amount\"]\n            }\n        }\n    },\n    {\n        \"type\": \"function\",\n        \"function\": {\n            \"name\": \"spread\",\n            \"description\": \"spread something on a piece of bread\",\n            \"parameters\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"chain_of_thought\": {\n                        \"type\": \"string\",\n                        \"description\": \"the reasoning for this action\"\n                    },\n                    \"substance\": {\n                        \"type\": \"string\",\n                        \"description\": \"what to spread\"\n                    }\n                },\n                \"required\": &#91;\"chain_of_thought\", \"substance\"]\n            }\n        }\n    }\n]<\/code><\/pre>\n\n\n\n<p>Unfortunately, this will make the agent output the function name <em>before<\/em> it produces the chain-of-thought, locking it into the early decision and defeating the whole point. One way to get around this is to pass in the schema with the available functions, but ask the agent to output a JSON that wraps the function spec, similar to this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>You are an assistant helping prepare food. Given a dish name, respond with a JSON object of the following structure:\n\n{{\n    \"chain_of_thought\": \"Describe your reasoning for the next action\",\n    \"function_name\": \"scoop_out\" | \"spread\",\n    \"function_args\": {{\n        &lt; appropriate arguments for the function &gt;\n    }}\n}}<\/code><\/pre>\n\n\n\n<p>This has fewer guarantees, but works well enough in practice with GPT 4 Turbo.<\/p>\n\n\n\n<p>Generally, chain-of-thought has a speed\/cost vs. accuracy trade-off, so pay attention to your latencies and token usage in addition to correctness.<\/p>\n\n\n\n<p>Another popular trick is <em>few-shot prompting<\/em>. In many cases, you\u2019ll get a noticeable bump in performance if you include a few examples of questions and their corresponding answers in your prompt, though this isn\u2019t always feasible. E.g., your actual input might be so large that including more than a single \u201cshot\u201d in your prompt isn\u2019t practical.<\/p>\n\n\n\n<p>Finally, you can try offering the agent a bribe or telling it something bad will happen if it answers incorrectly. We didn\u2019t find these tactics worked for us, but they might for you \u2014 they\u2019re worth trying, assuming you trust your eval process.<\/p>\n\n\n\n<p>Every trick from the ones listed above will change things: some things will hopefully get better, but it\u2019s likely you\u2019ll break something else at the same time. It\u2019s really important you have a representative set of examples that you can work with, so commit to spending time on building it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Improve_with_observability\"><\/span>Improve with observability<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>At this point, you\u2019ll likely have something that\u2019s good enough to deploy as an alpha-quality product. Which you should absolutely do as soon as you can so you can get real user data and feedback.<\/p>\n\n\n\n<p>It\u2019s important you\u2019re open about where your system is in terms of robustness, but it\u2019s impossible to overstate the value of users telling you what they think. It\u2019s tempting to get everything working correctly before opening up to users, but you\u2019ll hit a point of diminishing returns without real user feedback. This is an absolutely necessary ingredient to making your system better over time \u2014 don\u2019t skip it.<\/p>\n\n\n\n<p>Before release, just make sure your observability practices are solid. From my perspective, this basically means logging all of your LLM input\/output pairs so you can a) look at them and learn what your users need and b) label them manually to build up your eval set.<\/p>\n\n\n\n<p>There a many options to go with here, from big monitoring providers you might already be using to open-source libraries that help you trace your LLM calls. Some, like <a href=\"https:\/\/github.com\/traceloop\/openllmetry\" target=\"_blank\" rel=\"noreferrer noopener\">openllmetry<\/a> and <a href=\"https:\/\/github.com\/Arize-ai\/openinference\" target=\"_blank\" rel=\"noreferrer noopener\">openinference<\/a> even use the OpenTelemetry under the hood, which seems like a great idea. I haven\u2019t seen a tool focused on labeling the data you\u2019ve gathered and turning it into a validation set, however, which is why we built our own solution: store some JSON files in S3 and have a web interface to look at and label them. It doesn\u2019t have as many bells and whistles as off-the-shelf options, but it\u2019s enough for what we need at the moment.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Invest_in_RAG\"><\/span>Invest in RAG<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Once you\u2019ve exhausted all your prompt-engineering tricks and you feel like you\u2019re out of ideas and your performance is at a plateau, it might be time to invest in a RAG pipeline. Roughly, RAG is runtime prompt engineering where you build a system to dynamically add relevant things to your prompt before you ask the agent for an answer.<\/p>\n\n\n\n<p>An example might be answering questions about very recent events. This isn\u2019t something LLMs are good at, because they\u2019re not usually retrained to include the latest news. However, it\u2019s relatively straightforward to run a web search and include some of the most relevant news articles in your prompt before asking the LLM to give you the answer. If you have relevant data of your own you can leverage, it\u2019s likely to give you another noticeable improvement.<\/p>\n\n\n\n<p>Another example from our world: we\u2019ve got an agent interacting with the UI of an application based on plain-English prompts. We also have more than ten years worth of data from our clients writing testing prompts in English and our crowd of human testers executing those instructions. We can use this data to tell the <a href=\"https:\/\/www.rainforestqa.com\/ai-accelerated-testing\">AI UI testing<\/a> agent something like \u201cit looks like most human testers executing similar tasks clicked on button X and then typed Y into field Z\u201d to guide it.<\/p>\n\n\n\n<p>Retrieval is great and very powerful, but it has real trade-offs, complexity being the main one. Again, there are many options: you can roll your own solution, use an external provider, or have some combination of the two (e.g., using <a href=\"https:\/\/platform.openai.com\/docs\/guides\/embeddings\/how-to-get-embeddings\" target=\"_blank\" rel=\"noreferrer noopener\">OpenAI\u2019s embeddings<\/a> and <a href=\"https:\/\/github.com\/pgvector\/pgvector\" target=\"_blank\" rel=\"noreferrer noopener\">storing the vectors in your Postgres<\/a> instance).<\/p>\n\n\n\n<p>A particular library that looked great (a little more on the do-it-yourself end of the spectrum) is <a href=\"https:\/\/github.com\/bclavie\/RAGatouille\" target=\"_blank\" rel=\"noreferrer noopener\">RAGatouille<\/a>, but I wasn\u2019t able to make it work and gave up after a couple of hours. In the end, we used BigQuery to get data out, OpenAI for producing embeddings, and <a href=\"https:\/\/pinecone.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pinecone<\/a> for storage and nearest-neighbor search because that was the easiest way for us to deploy something without setting up a lot of new infrastructure. Pinecone makes it very easy to store and search through embeddings with their associated metadata to augment your prompts.<\/p>\n\n\n\n<p>There\u2019s more we can do here \u2014 we didn\u2019t evaluate any alternative embedding engines, we only find top-3 related samples, and get limited data out of those samples currently. Looking at alternative embeddings, including more samples, and getting more details information about each sample is something we plan to look at in the future.<\/p>\n\n\n\n<p>You can spend quite a while on this level of the ladder. There&#8217;s a lot of room for exploration. If you exhaust all the possibilities and ways of building your pipeline, still aren\u2019t getting good enough results and can\u2019t think any more sources of useful data, it\u2019s time to consider fine-tuning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Fine-tune_your_model\"><\/span>Fine-tune your model<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Now we\u2019re getting to the edges of the known universe. If you\u2019ve done everything above: created an eval system, shipped an AI product, observed it running in production, got real user data, and even have a useful RAG pipeline, then congratulations! You\u2019re on the bleeding edge of applied AI.<\/p>\n\n\n\n<p>Where to go from here is unclear. Fine-tuning a model based on the data you\u2019ve gathered so far seems like the obvious choice. But beware \u2014 I\u2019ve heard conflicting opinions in the industry about the merits of fine-tuning relative to the effort required.<\/p>\n\n\n\n<p>It&nbsp;<em>seems<\/em>&nbsp;like it should work better, but there are unresolved practical matters: OpenAI <a href=\"https:\/\/platform.openai.com\/docs\/guides\/fine-tuning\/fine-tuning\" target=\"_blank\" rel=\"noopener\">only allows you to fine-tune older models<\/a>, and <a href=\"https:\/\/docs.anthropic.com\/claude\/docs\/glossary#fine-tuning\" target=\"_blank\" rel=\"noopener\">Anthropic is kind of promising to make it available<\/a> soon with a bunch of caveats.<\/p>\n\n\n\n<p>Fine-tuning and hosting your own models is a whole different area of expertise. Which model do you choose as the base? How do you gather data for fine-tuning? How do you evaluate any improvements? And so on. In the case of self-hosted models, I\u2019d caution against hoping to save money vs. hosted solutions \u2014 you\u2019re very unlikely to have the expertise and the economies of scale to get there, even if you choose smaller models.<\/p>\n\n\n\n<p>My advice would be to wait a few months for the dust to settle a bit before investing here, unless you\u2019ve tried everything else already and still aren\u2019t getting good-enough results. We haven\u2019t had to do this so far because we still haven\u2019t exhausted all the possibilities mentioned above, so we\u2019re postponing the increase in complexity.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Use_complementary_agents\"><\/span>Use complementary agents<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Finally, I want to share a trick that might be applicable to your problem, but is sort of independent of the whole process described above \u2014 you can apply it at any of the stages I\u2019ve described.<\/p>\n\n\n\n<p>It involves a bit of a computation vs. reliability trade-off: it turns out that in many situations it\u2019s possible to throw more resources at a problem to get better results. The only question is: can you find a balance that\u2019s both fast and cheap enough while being accurate enough?<\/p>\n\n\n\n<p>You\u2019ll often feel like you\u2019re playing whack-a-mole when trying to fix specific problems with your LLM prompts. For instance, I often find there\u2019s a tension between creating the correct high-level plan of execution and the ability to\u200c precisely execute it. This reminded me of the idea behind&nbsp;<a href=\"https:\/\/blog.gardeviance.org\/2015\/03\/on-pioneers-settlers-town-planners-and.html\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cpioneers, settlers, and city planners\u201d<\/a>: different people have different skills and approaches and thrive in different situations. It\u2019s rare for a single person to both have a good grand vision and to be able to precisely manage the vision&#8217;s execution.<\/p>\n\n\n\n<p>Of course, LLMs aren\u2019t people, but some of their properties make the analogy work. While it\u2019s difficult to prompt your way to an agent that always does the right thing, it\u2019s much easier to plan what\u2019s needed and create a team of specialists that complement each other.<\/p>\n\n\n\n<p>I\u2019ve seen a similar approach called an \u201censemble of agents,\u201d but I prefer \u201ccomplementary agents\u201d for this approach because it highlights that the agents are meaningfully different and support each other in ways that identical agents couldn\u2019t.<\/p>\n\n\n\n<p>For example, to achieve a non-obvious goal, it helps to create a high-level plan first, before jumping into the details. While creating high-level plans, it\u2019s useful to have a very broad and low-resolution view of the world without getting bogged down by details. Once you have a plan, however, executing each subsequent step is much easier with a narrow, high-resolution view of the world. Specific details can make or break your work, and at the same time, seeing irrelevant information can confuse you. How do we square this circle?<\/p>\n\n\n\n<p>One answer is creating teams of complementary agents to give each other feedback. LLMs are pretty good at correcting themselves if you tell them what they got wrong, and it\u2019s not too difficult to create a \u201cverifier\u201d agent that checks specific aspects of a given response.<\/p>\n\n\n\n<p>An example conversation between \u201chigh level planner\u201d and \u201cverifier\u201d agents might look something like the following:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Planner\nOK, we need to make a PB&amp;J sandwich! To do that, we need to get some peanut\nbutter, some jelly and some bread, then take out a plate and a knife.\n\nVerifier\nCool, that sounds good.\n\nPlanner\nOK, now take the peanut butter and spread it on the bread.\n\nVerifier\n(noticing there's no slice of bread visible) Wait, I can't see the bread in\nfront of me, you can't spread anything on it because it's not there.\n\nPlanner\nAh, of course, we need to take the slice of bread out first and put it on a\nplate.\n\nVerifier\nYep, that seems reasonable, let's do it.\n<\/code><\/pre>\n\n\n\n<p>The two agents complement each other, and neither can work on its own. Nobody is perfect, but we can build a reliable system out of flawed pieces if we\u2019re thoughtful about it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How we&#8217;re using complementary agents<\/h3>\n\n\n\n<p>This is exactly what we did for our agents: there\u2019s a planner and a verifier. The planner knows the overall goal and tries to achieve it. It&#8217;s creative and can usually find a way to get to the goal even if it isn\u2019t immediately obvious. E.g., if you ask it to click on a product that\u2019s not on the current page, it&#8217;ll often use the search functionality to look for the product. But sometimes the planner is <em>too<\/em> optimistic and wants to do things that seem like they should be possible, but in fact aren\u2019t.<\/p>\n\n\n\n<p>For example, it might want to click on a \u201cPay now\u201d button on an e-commerce checkout page, because the button <em>should<\/em> be there, but it\u2019s just below the fold and not currently visible. In such cases, the verifier (who doesn\u2019t know the overall goal and is only looking at the immediate situation) can correct the planner and point out that the concrete task we\u2019re trying to do right now isn&#8217;t possible.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Final_notes\"><\/span><strong>Final notes<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Putting all this together again, we now have a pretty clear process for building LLM-based products that deal with their inherent unreliability.<\/p>\n\n\n\n<p>Making something usable in the real world still isn\u2019t a well-explored area, and things are changing really quickly, so you can\u2019t follow well-explored paths. It\u2019s basically applied research, rather than pure engineering. Still, having a clear process to follow while you\u2019re iterating will make your life easier and allow you to set the right expectations at the start.<\/p>\n\n\n\n<p>The process I\u2019ve described should follow a step-by-step increase in complexity, starting with naive prompting and finally doing RAG and possibly fine-tuning. At each step, evaluate whether the increased complexity is worth it \u2014 you might get good-enough results pretty early, depending on your use case. I bet getting through the first four steps will be enough to handle most of your problems.<\/p>\n\n\n\n<p>Keep in mind that this is going to be a very iterative process \u2014 there\u2019s no way to design and build AI systems in a single try. You won\u2019t be able to predict what works and how your users will bend your tool, so you absolutely need their feedback. You\u2019ll build the first, inadequate version, use it, notice flaws, improve it, release it more widely, etc. And if you\u2019re successful and build something that gets used, your prize will be more iterations and improvements. Yay!<\/p>\n\n\n\n<p>Also, don\u2019t sweat having a single, optimizable success metric too much. It\u2019s the ideal scenario, but it might take you a while to get to that point. Despite having a collection of tests and examples, we still rely \u201cvibe checks\u201d when evaluating whether a new prompt version is an improvement. Just counting passing examples might not be enough if some examples are more important than others.<\/p>\n\n\n\n<p>Finally, try the \u201ccomplementary agents\u201d trick to work around weaknesses you notice. It\u2019s often very difficult to make a single agent do the right thing reliably, but detecting the wrong thing so you can retry tends to be easier.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What\u2019s next for us<\/h3>\n\n\n\n<p>There\u2019s still a bunch of things that we\u2019re planning to work on in the coming months, so our product won&#8217;t be static. I don\u2019t expect, however, to deviate much from the process we described here. We\u2019re continuously speaking with our customers, monitoring how our product is being used as part of <a href=\"https:\/\/www.rainforestqa.com\/no-code-test-automation\">our no-code testing platform<\/a> and <a href=\"https:\/\/www.rainforestqa.com\/test-automation-services\">test automation services<\/a>, and finding edge cases to fix. We\u2019re also certainly not at the global optimum when it comes to reliability, speed, and cost, so we\u2019ll continue experimenting with alternative models, providers, and ways of work.<\/p>\n\n\n\n<p>Specifically, I\u2019m really intrigued by the latest Anthropic models (e.g., what can we usefully do with a small model like Haiku, which still has vision capabilities?) and I\u2019m deeply intrigued by the ideas <a href=\"https:\/\/github.com\/stanfordnlp\/dspy\" target=\"_blank\" rel=\"noopener\">DSPy<\/a> is promoting. I suspect there are some unrealized wins in the way we structure our prompts.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is the process our engineering team uses to create reliable AI systems out of unreliable AI agents.<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","inline_featured_image":false,"footnotes":""},"categories":[6],"tags":[15],"class_list":["post-1986","post","type-post","status-publish","format-standard","hentry","category-engineering","tag-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/1986","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/comments?post=1986"}],"version-history":[{"count":22,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/1986\/revisions"}],"predecessor-version":[{"id":2315,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/1986\/revisions\/2315"}],"wp:attachment":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/media?parent=1986"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/categories?post=1986"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/tags?post=1986"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}