Agents have a problem: they fail. A lot.
Why failure matters
Agent performance is the current name of the game. To understand why, consider the following toy model for agent value. Normally, we can calculate the net value of a given query with the following equation:
However agents behave a bit differently. We can break out into two components; cost of the query itself, denoted as , and the expected cost of the failure state of query i, denoted as .
The cost of a query here is not expected since both people and LLM APIs charge set rates.
This disambiguation is important because the set of possible failure states is essentially unbounded. To illustrate, consider the case where you want an agent to order groceries. For simplicity’s sake, let’s say you would be willing to pay ten dollars for someone to do this task, , and the cost of employing an agent in this case is ten cents, .
The value our agent can deliver is 100x that of the cost. Pretty good right? Not so fast. We still need to consider . Consider the following end states our agent might find itself in:
- The agent successfully orders your groceries;
- The agent fails order groceries;
- The agent buys the wrong items; or
- The agent puts the wrong address in the delivery information.
In scenario one, the value is $10.00. In scenario two, the value is -$0.10 since no transaction actually happens. Scenarios three and four are where things really start to get bad. Not only did the agent fail to complete the task, but now you have to clean up the mess the agent has made. If it takes the same amount of time to cancel, we’re looking at an expected cost of failure of -$10.10 dollars. If seriously bungles the order, the cost can balloon well beyond that.
Benchmark environments for web agents
Generating benchmarks from live sites is a pain. You can get hit with captchas, rate limited, and have to explain to your credit card company that those purchases weren’t technically fraud since the AI was acting on your behalf but — nonetheless — you don’t actually want 32 handbags from Nordstrom. (Not that we have any experience with that.)
But the biggest problem is that sites change. They can change for normal reasons like old products disappearing, new user reviews getting written, or a CRM updating the sales total for each new order. However, the tricky thing about trying to benchmark a web agent is that the agent itself is often asked to change the state of a website. As a result, benchmarking on actual websites is out of the question.
That’s why we’re using WebArena. WebArena is comprised of eight (we’re using six) realistic websites that are either open source clones or designed to simulate a real website. They are fully functional in the sense that you can browse products, make posts, lookup directions, etc. But they are neutered so you’ll never be charged for testing a checkout flow or get yelled at because your agent deleted all the draft blog posts 😅.
Arenas divorce actions taken on the web from real world outcomes. This means that any actions carried out within an arena, such as browsing products, making posts, or interacting with website features, have no actual impact beyond the arena environment. Developers can freely experiment and test their agents without concerns about unintended consequences or affecting real users.
But the most significant advantage of arenas is that they have a reset button. Resetting allows the arena's state, including any modifications made by the agent, to be effortlessly reverted. This aspect is especially critical when evaluating web agents since these agents frequently need to alter the website's state during their tasks. By being able to reset the state, each benchmarking attempt begins from a pristine and consistent state when needed, enabling precise comparisons and evaluations.
Sample WebArena tasks
|What is the top-1 best-selling product in 2022
|List out reviewers, if exist, who mention about ear cups being small
|Find the page of the place where Mr. Rogers was filmed on the map.
|Find the page of the university that has most Turning Award winners on the map.
|Assign the issue regarding 404 in a11yproject to myself.
The WebArena benchmark dataset contains 812 tasks. These tasks range from simple information retrieval, such as listing negative reviewers, to complex manipulation tasks such as querying a content management system.
As you might imagine from such a range of tasks, success rates vary wildly. For some tasks, agents reliably perform tasks almost 100% of the time. For others, the failure rate is 100%. From our experience working on these benchmarks over the past few weeks, agent failure rate is directly correlated with the horizon length of the task. In plain english, the more steps a tasks takes the higher the failure rate.
However, our experience also shows that it is possible to bring error rates down. We think of long horizon tasks as being composed of a series of primitive tasks. In order to buy your groceries, an agent needs to navigate to the page, use the search bar, add items to cart, etc. Therefore, the first step to agents that work is getting really good at the basics.
That’s why we’re releasing our benchmark status page. We want builders to have a clear understanding of what it is possible to do with agents while giving them a place to test things out (audit us) that limits the blast radius when things go wrong. We’re still in private beta, but we encourage you to check back here for updates and sign up for the waitlist.