{"id":3482,"date":"2026-02-12T23:37:37","date_gmt":"2026-02-12T23:37:37","guid":{"rendered":"https:\/\/www.rainforestqa.com\/blog\/?p=3482"},"modified":"2026-02-16T17:10:39","modified_gmt":"2026-02-16T17:10:39","slug":"web-application-crawler","status":"publish","type":"post","link":"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler","title":{"rendered":"5 Lessons learned building a web application crawler"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-1024x576.png\" alt=\"\" class=\"wp-image-3511\" srcset=\"https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-1024x576.png 1024w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-300x169.png 300w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-768x432.png 768w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-1536x864.png 1536w, https:\/\/www.rainforestqa.com\/blog\/wp-content\/uploads\/2026\/02\/Web-Crawler-App-Blog-Post-Jiri-Header-Image-2048x1152.png 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Building a web application crawler came with plenty of challenges\u2014here&#8217;s what we learned.<\/em><br><br>Recently, we built a web application crawler from scratch\u2014which had some scratching their heads, asking why we&#8217;d undertake such a thing. Here&#8217;s our answer to that, plus some interesting technical challenges we ran into and how we tackled them.<\/p>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_84 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#The_blank_page_problem_in_QA\" >The blank page problem in QA<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#The_path_to_building_our_web_application_crawler\" >The path to building our web application crawler<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#Option_3_Crawl_application_UIs\" >Option 3. Crawl application UIs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#The_decision_Build_our_own_web_application_crawler\" >The decision: Build our own web application crawler<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#Building_the_crawler\" >Building the crawler<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.rainforestqa.com\/blog\/web-application-crawler\/#Final_thoughts_Expect_the_unexpected\" >Final thoughts: Expect the unexpected<\/a><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_blank_page_problem_in_QA\"><\/span>The blank page problem in QA<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Many customers of our QA platform face the same questions when it comes to testing their apps:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201c<em>What<\/em> should I test?\u201d \u201dHow do I split my test cases?\u201d \u201dWhere do I even start?\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We sometimes refer to this as the \u201cblank page problem.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s challenging to determine which test cases to build and how to organize them, especially if you weren\u2019t intimately involved in building the entire application (and even if you were).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve long been aware of the blank page problem in QA, but we didn\u2019t see a viable path to solving this problem until recently.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The advent of more powerful AI models made it possible help our customers automate this process. So, in 2025, we added a <a href=\"https:\/\/www.rainforestqa.com\/blog\/ai-test-planner-launch-ai-testing-tools\">test planning feature<\/a> to our product roadmap.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After launching it, we spoke with one of our long-time customers. They told us, \u201cI knew mapping test coverage was going to be a big project. Then Rainforest released <a href=\"https:\/\/www.rainforestqa.com\/blog\/ai-test-planner-launch-ai-testing-tools\">AI Test Planner<\/a>\u2014it felt like the solution just dropped straight into my lap.\u201d That further validated our decision to prioritize this feature.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_path_to_building_our_web_application_crawler\"><\/span><strong>The path to building our web application crawler<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before getting started, I considered several paths forward:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Option 1: Collect analytic events or track user interactions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Our questions: Could we tap into existing analytic events to find the most frequently used user flows? Or should we build our own JavaScript library to track that?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I knew existing analytics setups would be diverse and of varying depth and quality. It would be hard to normalize the events and turn them into something useful. It didn\u2019t seem like a good way to get what we wanted.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Building our own JS library for tracking user interactions would give us more control. Building good user interaction tracking would be challenging, but probably doable given enough time. Another challenge would be convincing clients to integrate such a library into their applications. <br><br>However, doing this properly \u2014 ensuring we don\u2019t collect sensitive information, remaining compliant with security and privacy standards, etc. seemed like an unnecessarily steep hill to climb.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Option 2: Analyze client code repositories<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Our question: Can we use LLMs to analyze client codebases and extract user-facing flows out of them?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I tried to analyze our own frontend codebase to see where this path could lead us. Although there were some reasonable user flow definitions in the output, there were also a lot of irrelevant details an end-user would never see in the UI. Filtering those out properly was tricky.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It was also quite time consuming and relatively expensive to analyze the whole codebase. It has to be done in small pieces to get cleaner output, leading to a process that goes file by file, then summarizes the whole folder, and so on. Each of those steps would require a call to an LLM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Even if I managed to define smart enough prompts to deliver useful output (and put aside the cost and time requirements), there was still one big hurdle with this approach: We\u2019d need direct access to our clients\u2019 code repositories.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While many of our competitors do require this level of access, we believe it\u2019s best to avoid this for security and privacy reasons.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The alternative would be to ask clients to run the script themselves and then return the output to us. However, this still presented some security issues: What if their repository contained some authentication tokens, secrets, etc.? These could be leaked in the process.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Although this was an interesting possible path, I realized either the output would be messy or we\u2019d be dependent on code access, presenting security issues we prefer to avoid.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Option_3_Crawl_application_UIs\"><\/span>Option 3. Crawl application UIs<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Our question: Could we crawl the client\u2019s application with a web application crawler and collect information that could be used to suggest which user flows should be tested?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, I looked around to see if there was an existing web application crawler I could use for this purpose. Quite a few are sold \u201cas a service,\u201d but some open source crawlers \/ scrapers are also available (e.g., <a href=\"https:\/\/crawlee.dev\/\" target=\"_blank\" rel=\"noopener\">Crawlee<\/a>, <a href=\"https:\/\/www.scrapy.org\/\" target=\"_blank\" rel=\"noopener\">Scrapy<\/a>). <br><br>All the tools I evaluated seemed to have one thing in common: They\u2019re built for publicly accessible, static web pages. They can extract URLs from anchor tags, which they visit to collect more URLs, and so on. They can collect some data from the pages after navigating to them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That\u2019s all fine and dandy, but I realized:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We\u2019d need to get information about interactive applications, including ones that use authentication.<\/li>\n\n\n\n<li>Applications are often built as <a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Glossary\/SPA\" target=\"_blank\" rel=\"noopener\">SPAs<\/a> and sometimes don\u2019t use anchor tags with URLs for navigation.<\/li>\n\n\n\n<li>We\u2019d need to collect as many details about the UI as possible, which includes parts that are only shown after an interactive element is clicked.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Because of these three necessities, I knew that if we wanted to crawl our clients\u2019 applications to build a potential test map, I would have to build our own crawler that could handle complex, interactive UIs better than off-the-shelf options.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"The_decision_Build_our_own_web_application_crawler\"><\/span>The decision: Build our own web application crawler<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">After doing all this research and testing, I put together a proof of concept for our own crawler using Playwright. The intial results were promising \u2014 enough to convince us this was worth pursuing. <br><br>I knew it would be challenging to make it work for a wide variety of web applications, because there are so many different ways they can be built. Any assumptions about web standards would have to be thrown out the window. <br><br>Still, I felt confident that, given enough time and patience, I could make the crawler work for most applications \u2014 and that it would be worth it for our clients.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Building_the_crawler\"><\/span>Building the crawler<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As explained above, our goal was to map the UI surface of an application by discovering as many unique UI states as possible to build a strong QA test plan. <br><br>To do this, the crawler would need to navigate through URLs, click on interactive elements, and handle authentication hurdles. The gathered data could then be used to define test flows covering the application\u2019s full functionality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let\u2019s talk about how I actually built the crawler, including some challenges and lessons encountered while implementing it that others may find useful.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Challenge 1: Authentication \u2014 getting in (and staying in)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The web application crawler is mostly used on apps that require authentication. This means it has to have the ability to log in to apps with different login flows and forms, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standard username and password<\/li>\n\n\n\n<li>Logins requiring email addresses, phone numbers, user IDs, etc.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To handle these varying login flows and forms, I built an AI agent that is assigned a username and password and uses Playwright to navigate through login forms. This was fairly straightforward at first, but more interesting cases started popping up during beta testing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, consider this flow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Land on a page that has a login link in the header<\/li>\n\n\n\n<li>Click it to reveal a modal that asks for an email address<\/li>\n\n\n\n<li>Fill out the form and submit it, leading to a Microsoft Cloud login page that asks for an email<\/li>\n\n\n\n<li>After pressing a \u201ccontinue\u201d button, it asks for a password<\/li>\n\n\n\n<li>Submitting the form redirects back to the initial page \u2014 but this time with a different modal that shows two drop-downs; one of them has no selected option and the form can\u2019t be submitted until an option is selected<\/li>\n\n\n\n<li>Submitting the form after choosing an option from the dropdown finally leads to the actual app UI (phew)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Needless to say, I had to adjust the crawler to handle this sort of login flow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The agent currently doesn\u2019t have the ability to handle two-factor authentication because it comes in so many shapes and forms today, but fortunately the crawler still seems to work well for most apps we\u2019ve tried it against.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Keeping the crawler signed in was another challenge. Since it navigates through the application and clicks around, there\u2019s always a chance it could sign out of the application and get stuck. <br><br>To get around this, I implemented ways to detect whether the authentication session still is alive. I added a re-authentication process that gets triggered when the session is been lost. It also works to avoid buttons and URLs that could end the session.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Challenge 2: Defining page boundaries when drawing an application map<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Navigating web apps is easy, right? You just follow anchor tags with URLs in href attributes and voila! Well, not exactly\u2026 That\u2019s the ideal scenario, of course. Web standards do exist, and we like to pretend everyone follows them, but the reality is that people can and do get very, well, creative\u2026 <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the real world, we come across myriad different web app navigation styles. So, to make the web application crawler work, I had to make it smart enough to deduce the \u201cintention\u201d of various elements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Consider these examples:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<ol class=\"wp-block-list\">\n<li>A navigation menu with plain DIV tags as items and some JavaScript attached to them resulting in route changes when clicked: Happens.<\/li>\n\n\n\n<li>An application where the URL is static although it contains many different pages: Not as uncommon as you might think.<\/li>\n\n\n\n<li>Anchor tags with URLs that aren\u2019t meant to be visited, just clicked on? JavaScript that stops the standard URL-based navigation and instead just fetches a chunk of HTML from the URL and replaces a section of content on the screen with it? Rare, but it\u2019s out there.<\/li>\n<\/ol>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Our web application crawler is currently set up in a way where it\u2019s helpful to know that a new page has been reached. This is because the crawler is exploring the app by finding links and by clicking on interactive elements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each new page is scanned for interactive elements, and, if some are found, new crawl tasks are added to the queue. That said, it\u2019s effectively impossible to clearly define application page boundaries for all web apps.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One application already forced us to prototype a different approach to crawling where:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Anchor tags with href URLs don\u2019t get special treatment (i.e., they\u2019re not scanned to collect new URLs that should be added to the crawler queue); all interactive elements are handled the same way<\/li>\n\n\n\n<li>The URL state doesn\u2019t matter<\/li>\n\n\n\n<li>Significant new UIs are detected using visual diffing.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The drawback to this approach is that the location of a new page is described by element interactions that preceded reaching it. So instead of navigating to a specific URL at the start of a task, the crawler has to execute a chain of interactions to get to a task location (which is naturally more error-prone). (Note: This prototype hasn\u2019t make it to production yet, since apps that need this type of special treatment are fairly rare.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Detecting page transitions is one problem, but what if the crawler navigates to a page with a different hostname?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Challenge 3: Defining crawling boundaries<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Which URLs should we crawl in a given application? Once again, this is a problem that looks relatively well-defined and easy to handle at first glance. But on closer examination, it&#8217;s more complex than it seems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is basically about telling the crawler what is a valid hostname to crawl and what isn\u2019t. Subdomains makes for an even more interesting challenge. For example, consider these two contrasting scenarios:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The user lands on <code>dashboard.app.com<\/code> . After logging in, the navigation menu leads to other pages which are each hosted on another subdomain of <code>app.com<\/code> , like <code>demo.app.com<\/code> or <code>settings.app.com<\/code>. All of these should be crawled.<\/li>\n\n\n\n<li>The user lands on <code>client.app.com<\/code> . After logging in, the navigation menu leads to other parts of the app on the same domain. But the app also contains links to <code>support.app.com<\/code>, <code>blog.app.com<\/code> , and <code>statuspage.app.com<\/code>. Only content on <code>client.app.com<\/code> should be crawled; the rest is static content that\u2019s not useful for figuring out what tests need to be defined to cover the main application logic.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">I couldn\u2019t think of any heuristic that would avoid these issues altogether. I ended up creating a subdomain validation agent which checks the contents of newly discovered subdomains and decides whether they\u2019re relevant for the crawler or not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once the web application crawler knows where it\u2019s allowed to go, the next step is finding out which elements to interact with.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Challenge 4: Finding interactive elements (a.k.a. just give me a <code>&lt;button&gt;<\/code>, please)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately for our purposes when building the crawler, finding interactive elements on a page can be hard in certain cases. Web developers are a creative bunch, and it shows. Try to think of the most obscure way a button could be defined on a web page. There\u2019s a good chance somebody has already done it \u2014 and it\u2019s out there in production generating money.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a result, the process of describing interactive elements for the crawler leads to an ever-growing set of heuristics and CSS selectors, scanning webpages for any and all unexpected engineering choices. Here\u2019s a peek into some of the exceptions we\u2019ve had to build around:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>No Semantic HTML<\/strong> <\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Everything would be so much simpler if developers consistently used <code>button<\/code> tags or at least <code>role=\"button\"<\/code> attributes for buttons and other accessibility roles like <code>menu<\/code> and <code>menu-item<\/code> in the UIs they build.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Instead, in some real-world apps, all you get is a plain <code>div<\/code> element with no significant attributes, maybe with a class name that includes the word \u201cbutton.\u201d Another app might use table elements instead of buttons \u2014 and by table elements I mean <code>table -&gt; tbody -&gt; tr -&gt; td<\/code>, the whole hierarchy (to keep the HTML valid). Of course, these table buttons are used within a table-based UI. Seeing applications like this can feel like taking a peek into a time capsule&#8230; that maybe shouldn\u2019t have been opened\u2026<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>iframes, shadow DOM and custom HTML<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Even once I had a relatively good set of rules put together that could find most elements in most apps, I started running into pages that used iframes, custom HTML tags, and shadow DOM (oh my!) Same-origin iframes and open shadow DOM nodes are manageable, but custom HTML elements were an interesting challenge.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Mapping custom HTML tag interactivity<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The only way to identify a custom tag is that its name has to contain a dash character <code>-<\/code>. CSS selectors for tag names with wildcards are not allowed, so finding them is only possible by checking the name of all elements on the page for a dash. Not all custom tags are interactive, though, so I decided to develop an exploratory phase in which the web application crawler tries to interact with the custom tags it finds. It keeps track of the tags that resulted in the page getting updated when interacted with and of tags which did nothing. When the exploratory phase ends, the crawler will continue using the custom tags from the interactive bucket and forget about the rest.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Challenge 5: Generating selectors for interactive elements<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">I ended up with a prioritized list of usable attributes (like <code>data-test-id<\/code>) and selector generation options, with an ugly final fallback to a CSS selector that maps the whole path through the tag hierarchy from the root of the page.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At one point, I suddenly realized it was naive to think that element IDs could be used as stable selectors. Alas, some frontend frameworks generate random element IDs that are different each time the page is loaded, which means IDs can\u2019t be trusted. (At least not all of them.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>One useful tip:<\/strong> XPaths can give you something CSS selectors can\u2019t: the ability to search by text content. I decided to use XPath selectors in case an element had text content. A combination of an element\u2019s tag name and text content can often provide a good selector. Having XPaths in the toolbox has proved useful.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Final_thoughts_Expect_the_unexpected\"><\/span>Final thoughts: Expect the unexpected<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s impossible to be prepared for all the different ways a web application can be built. Building a web application crawler like Rainforest\u2019s (or a similar project) requires testing against a wide variety of real-world web apps. The best you can do is learn as you go, observe and adjust, throw away assumptions one by one, and embrace the reality that it will never be perfect.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I recognized from the very start that we couldn\u2019t build a perfect crawler \u2014 no one can (at least at this point in time). And we\u2019re okay with that. As they say, \u201cdon\u2019t let perfect be the enemy of good.\u201d <br><br>Our crawler works alongside LLMs that are also imperfect (by nature). The resulting test plans aren\u2019t perfect, but they provide a really solid start for users who need to generate an initial test plan or extend coverage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most importantly, customers have shared how helpful the feature our web application crawler powers (<a href=\"https:\/\/www.rainforestqa.com\/blog\/ai-test-planner-launch-ai-testing-tools\">AI Test Planner<\/a>) has proven for their QA workflow. That makes solving these challenges well worth it to us.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How we built a custom web crawler to solve QA test planning. Learn about handling authentication, SPAs, interactive elements, and the unexpected challenges of crawling modern web applications.<\/p>\n","protected":false},"author":4,"featured_media":1601,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","inline_featured_image":false,"footnotes":""},"categories":[6],"tags":[15,29,37],"class_list":["post-3482","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering","tag-ai","tag-engineering","tag-quality-assurance"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/3482","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/comments?post=3482"}],"version-history":[{"count":8,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/3482\/revisions"}],"predecessor-version":[{"id":3528,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/posts\/3482\/revisions\/3528"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/media\/1601"}],"wp:attachment":[{"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/media?parent=3482"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/categories?post=3482"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rainforestqa.com\/blog\/wp-json\/wp\/v2\/tags?post=3482"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}