Agents have a problem: they fail. A lot.
Why failure matters
Agent performance is the current name of the game. To understand why, consider the following toy model for agent value. Normally, we can calculate the net value of a given query with the following equation:
However agents behave a bit differently. We can break out into two components; cost of the query itself, denoted as , and the expected cost of the failure state of query i, denoted as .
The cost of a query here is not expected since both people and LLM APIs charge set rates.
This disambiguation is important because the set of possible failure states is essentially unbounded. To illustrate, consider the case where you want an agent to order groceries. For simplicity’s sake, let’s say you would be willing to pay ten dollars for someone to do this task, , and the cost of employing an agent in this case is ten cents, .
The value our agent can deliver is 100x that of the cost. Pretty good right? Not so fast. We still need to consider . Consider the following end states our agent might find itself in:
- The agent successfully orders your groceries;
- The agent fails order groceries;
- The agent buys the wrong items; or
- The agent puts the wrong address in the delivery information.
In scenario one, the value is $10.00. In scenario two, the value is -$0.10 since no transaction actually happens. Scenarios three and four are where things really start to get bad. Not only did the agent fail to complete the task, but now you have to clean up the mess the agent has made. If it takes the same amount of time to cancel, we’re looking at an expected cost of failure of -$10.10 dollars. If seriously bungles the order, the cost can balloon well beyond that.
Benchmark environments for web agents
Generating benchmarks from live sites is a pain. You can get hit with captchas, rate limited, and have to explain to your credit card company that those purchases weren’t technically fraud since the AI was acting on your behalf but — nonetheless — you don’t actually want 32 handbags from Nordstrom. (Not that we have any experience with that.)
But the biggest problem is that sites change. They can change for normal reasons like old products disappearing, new user reviews getting written, or a CRM updating the sales total for each new order. However, the tricky thing about trying to benchmark a web agent is that the agent itself is often asked to change the state of a website. As a result, benchmarking on actual websites is out of the question.
That’s why we’re using WebArena. WebArena is comprised of eight (we’re using six) realistic websites that are either open source clones or designed to simulate a real website. They are fully functional in the sense that you can browse products, make posts, lookup directions, etc. But they are neutered so you’ll never be charged for testing a checkout flow or get yelled at because your agent deleted all the draft blog posts 😅.
Arenas divorce actions taken on the web from real world outcomes. This means that any actions carried out within an arena, such as browsing products, making posts, or interacting with website features, have no actual impact beyond the arena environment. Developers can freely experiment and test their agents without concerns about unintended consequences or affecting real users.
But the most significant advantage of arenas is that they have a reset button. Resetting allows the arena's state, including any modifications made by the agent, to be effortlessly reverted. This aspect is especially critical when evaluating web agents since these agents frequently need to alter the website's state during their tasks. By being able to reset the state, each benchmarking attempt begins from a pristine and consistent state when needed, enabling precise comparisons and evaluations.
Sample WebArena tasks
Arena | Task ID | Objective |
---|---|---|
http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:7780/admin | https://api.hdr.is/benchmarks/task/000 | What is the top-1 best-selling product in 2022 |
http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:9999/forums/all | https://api.hdr.is/benchmarks/task/21 | List out reviewers, if exist, who mention about ear cups being small |
http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:3000/ | https://api.hdr.is/benchmarks/task/424 | Find the page of the place where Mr. Rogers was filmed on the map. |
http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing | https://api.hdr.is/benchmarks/task/427 | Find the page of the university that has most Turning Award winners on the map. |
http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8023/explore |
| Assign the issue regarding 404 in a11yproject to myself. |