Building a web application crawler came with plenty of challenges—here’s what we learned.

Recently, we built a web application crawler from scratch—which had some scratching their heads, asking why we’d undertake such a thing. Here’s our answer to that, plus some interesting technical challenges we ran into and how we tackled them.

The blank page problem in QA

Many customers of our QA platform face the same questions when it comes to testing their apps:

What should I test?” ”How do I split my test cases?” ”Where do I even start?”

We sometimes refer to this as the “blank page problem.”

It’s challenging to determine which test cases to build and how to organize them, especially if you weren’t intimately involved in building the entire application (and even if you were).

We’ve long been aware of the blank page problem in QA, but we didn’t see a viable path to solving this problem until recently.

The advent of more powerful AI models made it possible help our customers automate this process. So, in 2025, we added a test planning feature to our product roadmap.

After launching it, we spoke with one of our long-time customers. They told us, “I knew mapping test coverage was going to be a big project. Then Rainforest released AI Test Planner—it felt like the solution just dropped straight into my lap.” That further validated our decision to prioritize this feature.

The path to building our web application crawler

Before getting started, I considered several paths forward:

Option 1: Collect analytic events or track user interactions

Our questions: Could we tap into existing analytic events to find the most frequently used user flows? Or should we build our own JavaScript library to track that?

I knew existing analytics setups would be diverse and of varying depth and quality. It would be hard to normalize the events and turn them into something useful. It didn’t seem like a good way to get what we wanted.

Building our own JS library for tracking user interactions would give us more control. Building good user interaction tracking would be challenging, but probably doable given enough time. Another challenge would be convincing clients to integrate such a library into their applications.

However, doing this properly — ensuring we don’t collect sensitive information, remaining compliant with security and privacy standards, etc. seemed like an unnecessarily steep hill to climb.

Option 2: Analyze client code repositories

Our question: Can we use LLMs to analyze client codebases and extract user-facing flows out of them?

I tried to analyze our own frontend codebase to see where this path could lead us. Although there were some reasonable user flow definitions in the output, there were also a lot of irrelevant details an end-user would never see in the UI. Filtering those out properly was tricky.

It was also quite time consuming and relatively expensive to analyze the whole codebase. It has to be done in small pieces to get cleaner output, leading to a process that goes file by file, then summarizes the whole folder, and so on. Each of those steps would require a call to an LLM.

Even if I managed to define smart enough prompts to deliver useful output (and put aside the cost and time requirements), there was still one big hurdle with this approach: We’d need direct access to our clients’ code repositories.

While many of our competitors do require this level of access, we believe it’s best to avoid this for security and privacy reasons.

The alternative would be to ask clients to run the script themselves and then return the output to us. However, this still presented some security issues: What if their repository contained some authentication tokens, secrets, etc.? These could be leaked in the process.

Although this was an interesting possible path, I realized either the output would be messy or we’d be dependent on code access, presenting security issues we prefer to avoid.

Option 3. Crawl application UIs

Our question: Could we crawl the client’s application with a web application crawler and collect information that could be used to suggest which user flows should be tested?

First, I looked around to see if there was an existing web application crawler I could use for this purpose. Quite a few are sold “as a service,” but some open source crawlers / scrapers are also available (e.g., Crawlee, Scrapy).

All the tools I evaluated seemed to have one thing in common: They’re built for publicly accessible, static web pages. They can extract URLs from anchor tags, which they visit to collect more URLs, and so on. They can collect some data from the pages after navigating to them.

That’s all fine and dandy, but I realized:

  1. We’d need to get information about interactive applications, including ones that use authentication.
  2. Applications are often built as SPAs and sometimes don’t use anchor tags with URLs for navigation.
  3. We’d need to collect as many details about the UI as possible, which includes parts that are only shown after an interactive element is clicked.

Because of these three necessities, I knew that if we wanted to crawl our clients’ applications to build a potential test map, I would have to build our own crawler that could handle complex, interactive UIs better than off-the-shelf options.

The decision: Build our own web application crawler

After doing all this research and testing, I put together a proof of concept for our own crawler using Playwright. The intial results were promising — enough to convince us this was worth pursuing.

I knew it would be challenging to make it work for a wide variety of web applications, because there are so many different ways they can be built. Any assumptions about web standards would have to be thrown out the window.

Still, I felt confident that, given enough time and patience, I could make the crawler work for most applications — and that it would be worth it for our clients.

Building the crawler

As explained above, our goal was to map the UI surface of an application by discovering as many unique UI states as possible to build a strong QA test plan.

To do this, the crawler would need to navigate through URLs, click on interactive elements, and handle authentication hurdles. The gathered data could then be used to define test flows covering the application’s full functionality.

Now, let’s talk about how I actually built the crawler, including some challenges and lessons encountered while implementing it that others may find useful.

Challenge 1: Authentication — getting in (and staying in)

The web application crawler is mostly used on apps that require authentication. This means it has to have the ability to log in to apps with different login flows and forms, such as:

  • Standard username and password
  • Logins requiring email addresses, phone numbers, user IDs, etc.

To handle these varying login flows and forms, I built an AI agent that is assigned a username and password and uses Playwright to navigate through login forms. This was fairly straightforward at first, but more interesting cases started popping up during beta testing.

For example, consider this flow:

  • Land on a page that has a login link in the header
  • Click it to reveal a modal that asks for an email address
  • Fill out the form and submit it, leading to a Microsoft Cloud login page that asks for an email
  • After pressing a “continue” button, it asks for a password
  • Submitting the form redirects back to the initial page — but this time with a different modal that shows two drop-downs; one of them has no selected option and the form can’t be submitted until an option is selected
  • Submitting the form after choosing an option from the dropdown finally leads to the actual app UI (phew)

Needless to say, I had to adjust the crawler to handle this sort of login flow.

The agent currently doesn’t have the ability to handle two-factor authentication because it comes in so many shapes and forms today, but fortunately the crawler still seems to work well for most apps we’ve tried it against.

Keeping the crawler signed in was another challenge. Since it navigates through the application and clicks around, there’s always a chance it could sign out of the application and get stuck.

To get around this, I implemented ways to detect whether the authentication session still is alive. I added a re-authentication process that gets triggered when the session is been lost. It also works to avoid buttons and URLs that could end the session.

Challenge 2: Defining page boundaries when drawing an application map

Navigating web apps is easy, right? You just follow anchor tags with URLs in href attributes and voila! Well, not exactly… That’s the ideal scenario, of course. Web standards do exist, and we like to pretend everyone follows them, but the reality is that people can and do get very, well, creative…

In the real world, we come across myriad different web app navigation styles. So, to make the web application crawler work, I had to make it smart enough to deduce the “intention” of various elements.

Consider these examples:

  1. A navigation menu with plain DIV tags as items and some JavaScript attached to them resulting in route changes when clicked: Happens.
  2. An application where the URL is static although it contains many different pages: Not as uncommon as you might think.
  3. Anchor tags with URLs that aren’t meant to be visited, just clicked on? JavaScript that stops the standard URL-based navigation and instead just fetches a chunk of HTML from the URL and replaces a section of content on the screen with it? Rare, but it’s out there.

Our web application crawler is currently set up in a way where it’s helpful to know that a new page has been reached. This is because the crawler is exploring the app by finding links and by clicking on interactive elements.

Each new page is scanned for interactive elements, and, if some are found, new crawl tasks are added to the queue. That said, it’s effectively impossible to clearly define application page boundaries for all web apps.

One application already forced us to prototype a different approach to crawling where:

  1. Anchor tags with href URLs don’t get special treatment (i.e., they’re not scanned to collect new URLs that should be added to the crawler queue); all interactive elements are handled the same way
  2. The URL state doesn’t matter
  3. Significant new UIs are detected using visual diffing.

The drawback to this approach is that the location of a new page is described by element interactions that preceded reaching it. So instead of navigating to a specific URL at the start of a task, the crawler has to execute a chain of interactions to get to a task location (which is naturally more error-prone). (Note: This prototype hasn’t make it to production yet, since apps that need this type of special treatment are fairly rare.)

Detecting page transitions is one problem, but what if the crawler navigates to a page with a different hostname?

Challenge 3: Defining crawling boundaries

Which URLs should we crawl in a given application? Once again, this is a problem that looks relatively well-defined and easy to handle at first glance. But on closer examination, it’s more complex than it seems.

This is basically about telling the crawler what is a valid hostname to crawl and what isn’t. Subdomains makes for an even more interesting challenge. For example, consider these two contrasting scenarios:

  1. The user lands on dashboard.app.com . After logging in, the navigation menu leads to other pages which are each hosted on another subdomain of app.com , like demo.app.com or settings.app.com. All of these should be crawled.
  2. The user lands on client.app.com . After logging in, the navigation menu leads to other parts of the app on the same domain. But the app also contains links to support.app.com, blog.app.com , and statuspage.app.com. Only content on client.app.com should be crawled; the rest is static content that’s not useful for figuring out what tests need to be defined to cover the main application logic.

I couldn’t think of any heuristic that would avoid these issues altogether. I ended up creating a subdomain validation agent which checks the contents of newly discovered subdomains and decides whether they’re relevant for the crawler or not.

Once the web application crawler knows where it’s allowed to go, the next step is finding out which elements to interact with.

Challenge 4: Finding interactive elements (a.k.a. just give me a <button>, please)

Unfortunately for our purposes when building the crawler, finding interactive elements on a page can be hard in certain cases. Web developers are a creative bunch, and it shows. Try to think of the most obscure way a button could be defined on a web page. There’s a good chance somebody has already done it — and it’s out there in production generating money.

As a result, the process of describing interactive elements for the crawler leads to an ever-growing set of heuristics and CSS selectors, scanning webpages for any and all unexpected engineering choices. Here’s a peek into some of the exceptions we’ve had to build around:

No Semantic HTML

Everything would be so much simpler if developers consistently used button tags or at least role="button" attributes for buttons and other accessibility roles like menu and menu-item in the UIs they build.

Instead, in some real-world apps, all you get is a plain div element with no significant attributes, maybe with a class name that includes the word “button.” Another app might use table elements instead of buttons — and by table elements I mean table -> tbody -> tr -> td, the whole hierarchy (to keep the HTML valid). Of course, these table buttons are used within a table-based UI. Seeing applications like this can feel like taking a peek into a time capsule… that maybe shouldn’t have been opened…

iframes, shadow DOM and custom HTML

Even once I had a relatively good set of rules put together that could find most elements in most apps, I started running into pages that used iframes, custom HTML tags, and shadow DOM (oh my!) Same-origin iframes and open shadow DOM nodes are manageable, but custom HTML elements were an interesting challenge.

Mapping custom HTML tag interactivity

The only way to identify a custom tag is that its name has to contain a dash character -. CSS selectors for tag names with wildcards are not allowed, so finding them is only possible by checking the name of all elements on the page for a dash. Not all custom tags are interactive, though, so I decided to develop an exploratory phase in which the web application crawler tries to interact with the custom tags it finds. It keeps track of the tags that resulted in the page getting updated when interacted with and of tags which did nothing. When the exploratory phase ends, the crawler will continue using the custom tags from the interactive bucket and forget about the rest.

Challenge 5: Generating selectors for interactive elements

I ended up with a prioritized list of usable attributes (like data-test-id) and selector generation options, with an ugly final fallback to a CSS selector that maps the whole path through the tag hierarchy from the root of the page.

At one point, I suddenly realized it was naive to think that element IDs could be used as stable selectors. Alas, some frontend frameworks generate random element IDs that are different each time the page is loaded, which means IDs can’t be trusted. (At least not all of them.)

One useful tip: XPaths can give you something CSS selectors can’t: the ability to search by text content. I decided to use XPath selectors in case an element had text content. A combination of an element’s tag name and text content can often provide a good selector. Having XPaths in the toolbox has proved useful.

Final thoughts: Expect the unexpected

It’s impossible to be prepared for all the different ways a web application can be built. Building a web application crawler like Rainforest’s (or a similar project) requires testing against a wide variety of real-world web apps. The best you can do is learn as you go, observe and adjust, throw away assumptions one by one, and embrace the reality that it will never be perfect.

I recognized from the very start that we couldn’t build a perfect crawler — no one can (at least at this point in time). And we’re okay with that. As they say, “don’t let perfect be the enemy of good.”

Our crawler works alongside LLMs that are also imperfect (by nature). The resulting test plans aren’t perfect, but they provide a really solid start for users who need to generate an initial test plan or extend coverage.

Most importantly, customers have shared how helpful the feature our web application crawler powers (AI Test Planner) has proven for their QA workflow. That makes solving these challenges well worth it to us.