这篇文章整理了一次从『传统浏览器自动化测试』一路深入到『AI Agent 驱动浏览器操作』的完整思考路径，核心关键词包括 Playwright、Rust、playwright-rs、agent-browser、Accessibility Tree、LLM。如果你正在思考如何让浏览器自动化更稳、更智能，或者如何把 LLM 接进真实网页操作，这篇总结可以作为一份系统性参考。

一、浏览器自动化的三条路线

在讨论具体工具前，先明确三种常见路线：

官方 Playwright（Node.js / Python / Java / .NET）
工程能力最强，生态最成熟，适合严肃的 E2E 测试。
Rust + Playwright
通过 Rust bindings 调用 Playwright（如 playwright-rs），在 Rust 工程内完成浏览器自动化。
Rust WebDriver / Selenium 生态
标准化强，但对现代浏览器能力（自动等待、可访问性、trace）支持有限。

本文重点集中在 路线 2 以及由此延伸出的 agent-browser + LLM 思路。

二、playwright-rs：在 Rust 中写浏览器自动化测试

playwright-rs 是目前最有前途的 Rust Playwright 绑定：

基于 官方 Playwright Server（Node.js）
Rust 侧通过 JSON-RPC 调用
支持 Chromium / Firefox / WebKit
支持 Linux / macOS / Windows
已在 GitHub Actions 跨平台跑通

一个最小的 Rust 测试用例

#[tokio::test]
async fn smoke_test() -> Result<(), Box<dyn std::error::Error>> {
    let pw = Playwright::launch().await?;
    let browser = pw.chromium().launch().await?;
    let page = browser.new_page().await?;

    page.goto("https://example.com", None).await?;

    let heading = page.locator("role=heading").await;
    assert!(heading.is_visible().await?);

    browser.close().await?;
    Ok(())
}

关键点在于：推荐使用 role / name 等基于可访问性语义的 locator，而不是 CSS selector。

三、Accessibility Tree 是什么？为什么它更稳？

它不是 DOM

Accessibility Tree 是浏览器内部为屏幕阅读器和辅助技术维护的一棵语义树：
- 节点是 button / textbox / heading / link
- 而不是 div / span
- 包含 role、accessible name、state（disabled/checked/expanded）
它由浏览器从 DOM + CSS + ARIA 自动计算生成。
为什么适合自动化与 AI ？
- 更贴近『人类理解的页面结构』
- 对 class、布局、嵌套不敏感
- 天然支持『这个按钮叫什么』『这个输入框是干嘛的』

Playwright 原生支持基于 accessibility 的定位，这也是它明显强于 Selenium 的地方。

ARIA（Accessible Rich Internet Applications）是一套给 Web 元素补充『语义、名称和状态』的规范，
浏览器据此计算 Accessibility Tree，供辅助技术、测试工具和 AI 使用。
ARIA 不是 Accessibility Tree 的『开关』，而是『校准器』；没有 ARIA，浏览器也能计算出树，只是精度可能不同。

四、Selenium 为什么很难支持 Accessibility Tree ？

结论先行：
Selenium 短期内几乎不可能完整支持 accessibility tree。

原因包括：

WebDriver 标准主要围绕 DOM
Accessibility Tree 属于浏览器内部语义模型
AOM（Accessibility Object Model）仍处于实验/讨论阶段
不同浏览器、不同操作系统的无障碍实现差异巨大

因此，Selenium 通常只能：

结合 axe-core 等工具做无障碍检测
而不是『读取并操作 accessibility tree』

五、agent-browser：为 AI Agent 设计的浏览器控制层

agent-browser 是什么
- Vercel Labs 出品
- 一个 Rust CLI + Playwright 后端的工具
- 不是测试框架，而是『浏览器操作协议的 CLI 实现』

它的核心创新是：

snapshot + ref-id 的交互模型

snapshot 的工作原理
1. 从 Playwright 读取页面的 accessibility tree
2. 语义裁剪（去掉无关节点）
3. 为每个节点分配稳定的引用 ID（如 @e2）
4. 输出一份『当前页面可操作状态』

示例：

1
2
3

@e1  heading "Log in to Miro"
@e2  textbox "Email"
@e3  button "Continue"

用 ref-id 操作页面

1 2	agent-browser fill @e2 "test@example.com" agent-browser click @e3

完全不需要 CSS selector。

In [1]:

source ~/.bashrc

第一步：确认你的环境前提

agent-browser 的现实前提有三点，缺一不可。

第一，macOS 本身没有问题，Intel 或 Apple Silicon 都可以。第二，需要 Node.js（用于 Playwright）。第三，需要 Rust（用于 agent-browser CLI）。

你可以先快速确认一下：

In [2]:

node -v
npm -v
rustc --version
cargo --version

v20.19.4
10.8.2
rustc 1.58.1 (db9d1b20b 2022-01-20)
cargo 1.58.0 (f01b232bc 2022-01-19)

如果 Node 没装，建议直接用：

In [ ]:

brew install node

In [ ]:

pnpm setup
pnpm config get global-dir
mkdir -p ~/.pnpm-global
pnpm config set global-dir ~/.pnpm-global
source ~/.zshrc

如果 Rust 没装：

In [ ]:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

In [ ]:

rustup update stable

或者：

In [ ]:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --profile minimal

In [ ]:

rustup update stable

装完记得重新开一个 shell，确保 cargo 在 PATH 里。

第二步：安装 Playwright（这是最容易被忽略的一步）

agent-browser 不是直接控制浏览器，它底层仍然依赖 Playwright 的 browser binaries。

在一个干净目录里执行一次即可：

In [ ]:

npx playwright install

或者：

In [ ]:

npm install -g playwright@latest

In [1]:

playwright --version

Version 1.58.2

In [ ]:

playwright install

这一步会下载 Chromium / Firefox / WebKit。 macOS 上第一次运行浏览器时，系统可能会弹出『安全性与隐私』提示，允许即可。

如果这一步没做，后面 agent-browser 会『能启动，但打不开页面』。

第三步：安装 agent-browser CLI

现在开始真正的 agent-browser。

官方仓库是：

https://github.com/vercel-labs/agent-browser

最直接、最稳妥的方式是 从源码安装：

In [ ]:

git clone https://github.com/vercel-labs/agent-browser.git
cd agent-browser
pnpm install
pnpm build
pnpm build:native   # Requires Rust (https://rustup.rs)
pnpm link --global  # Makes agent-browser available globally
agent-browser install

安装成功后，你应该能看到：

In [1]:

agent-browser --help

agent-browser - fast browser automation CLI for AI agents

Usage: agent-browser <command> [args] [options]

Core Commands:
  open <url>                 Navigate to URL
  click <sel>                Click element (or @ref)
  dblclick <sel>             Double-click element
  type <sel> <text>          Type into element
  fill <sel> <text>          Clear and fill
  press <key>                Press key (Enter, Tab, Control+a)
  hover <sel>                Hover element
  focus <sel>                Focus element
  check <sel>                Check checkbox
  uncheck <sel>              Uncheck checkbox
  select <sel> <val...>      Select dropdown option
  drag <src> <dst>           Drag and drop
  upload <sel> <files...>    Upload files
  download <sel> <path>      Download file by clicking element
  scroll <dir> [px]          Scroll (up/down/left/right)
  scrollintoview <sel>       Scroll element into view
  wait <sel|ms>              Wait for element or time
  screenshot [path]          Take screenshot
  pdf <path>                 Save as PDF
  snapshot                   Accessibility tree with refs (for AI)
  eval <js>                  Run JavaScript
  connect <port|url>         Connect to browser via CDP
  close                      Close browser

Navigation:
  back                       Go back
  forward                    Go forward
  reload                     Reload page

Get Info:  agent-browser get <what> [selector]
  text, html, value, attr <name>, title, url, count, box, styles

Check State:  agent-browser is <what> <selector>
  visible, enabled, checked

Find Elements:  agent-browser find <locator> <value> <action> [text]
  role, text, label, placeholder, alt, title, testid, first, last, nth

Mouse:  agent-browser mouse <action> [args]
  move <x> <y>, down [btn], up [btn], wheel <dy> [dx]

Browser Settings:  agent-browser set <setting> [value]
  viewport <w> <h>, device <name>, geo <lat> <lng>
  offline [on|off], headers <json>, credentials <user> <pass>
  media [dark|light] [reduced-motion]

Network:  agent-browser network <action>
  route <url> [--abort|--body <json>]
  unroute [url]
  requests [--clear] [--filter <pattern>]

Storage:
  cookies [get|set|clear]    Manage cookies (set supports --url, --domain, --path, --httpOnly, --secure, --sameSite, --expires)
  storage <local|session>    Manage web storage

Tabs:
  tab [new|list|close|<n>]   Manage tabs

Debug:
  trace start|stop [path]    Record trace
  record start <path> [url]  Start video recording (WebM)
  record stop                Stop and save video
  console [--clear]          View console logs
  errors [--clear]           View page errors
  highlight <sel>            Highlight element

Sessions:
  session                    Show current session name
  session list               List active sessions

Setup:
  install                    Install browser binaries
  install --with-deps        Also install system dependencies (Linux)

Snapshot Options:
  -i, --interactive          Only interactive elements
  -c, --compact              Remove empty structural elements
  -d, --depth <n>            Limit tree depth
  -s, --selector <sel>       Scope to CSS selector

Options:
  --session <name>           Isolated session (or AGENT_BROWSER_SESSION env)
  --profile <path>           Persistent browser profile (or AGENT_BROWSER_PROFILE env)
  --state <path>             Load storage state from JSON file (or AGENT_BROWSER_STATE env)
  --headers <json>           HTTP headers scoped to URL's origin (for auth)
  --executable-path <path>   Custom browser executable (or AGENT_BROWSER_EXECUTABLE_PATH)
  --extension <path>         Load browser extensions (repeatable)
  --args <args>              Browser launch args, comma or newline separated (or AGENT_BROWSER_ARGS)
                             e.g., --args "--no-sandbox,--disable-blink-features=AutomationControlled"
  --user-agent <ua>          Custom User-Agent (or AGENT_BROWSER_USER_AGENT)
  --proxy <server>           Proxy server URL (or AGENT_BROWSER_PROXY)
                             e.g., --proxy "http://user:pass@127.0.0.1:7890"
  --proxy-bypass <hosts>     Bypass proxy for these hosts (or AGENT_BROWSER_PROXY_BYPASS)
                             e.g., --proxy-bypass "localhost,*.internal.com"
  --ignore-https-errors      Ignore HTTPS certificate errors
  --allow-file-access        Allow file:// URLs to access local files (Chromium only)
  -p, --provider <name>      Browser provider: ios, browserbase, kernel, browseruse
  --device <name>            iOS device name (e.g., "iPhone 15 Pro")
  --json                     JSON output
  --full, -f                 Full page screenshot
  --headed                   Show browser window (not headless)
  --cdp <port>               Connect via CDP (Chrome DevTools Protocol)
  --debug                    Debug output
  --version, -V              Show version

Environment:
  AGENT_BROWSER_SESSION          Session name (default: "default")
  AGENT_BROWSER_EXECUTABLE_PATH  Custom browser executable path
  AGENT_BROWSER_PROVIDER         Browser provider (ios, browserbase, kernel, browseruse)
  AGENT_BROWSER_STREAM_PORT      Enable WebSocket streaming on port (e.g., 9223)
  AGENT_BROWSER_IOS_DEVICE       Default iOS device name
  AGENT_BROWSER_IOS_UDID         Default iOS device UDID

Examples:
  agent-browser open example.com
  agent-browser snapshot -i              # Interactive elements only
  agent-browser click @e2                # Click by ref from snapshot
  agent-browser fill @e3 "test@example.com"
  agent-browser find role button click --name Submit
  agent-browser get text @e1
  agent-browser screenshot --full
  agent-browser --cdp 9222 snapshot      # Connect via CDP port
  agent-browser --profile ~/.myapp open example.com  # Persistent profile

iOS Simulator (requires Xcode and Appium):
  agent-browser -p ios open example.com                    # Use default iPhone
  agent-browser -p ios --device "iPhone 15 Pro" open url   # Specific device
  agent-browser -p ios device list                         # List simulators
  agent-browser -p ios swipe up                            # Swipe gesture
  agent-browser -p ios tap @e1                             # Touch element

如果提示 command not found，检查 $HOME/.cargo/bin 是否在 PATH 中。

第四步：做一次最小可行测试（不接 LLM）

不要一上来就连 LLM，先验证三件事：

Playwright 能起浏览器
agent-browser 能读 accessibility tree
snapshot 能正常输出

按顺序来：

In [2]:

agent-browser open https://miro.com/login/ --headed

✓ Sign in | Miro | The Visual Workspace for Innovation
  https://miro.com/login/

你应该能看到浏览器窗口被打开。

接着：

In [3]:

agent-browser snapshot

- document:
  - button "Skip to main content" [ref=e1]
  - banner:
    - link "Go to Miro main page" [ref=e2]:
      - /url: /index/
      - text: Miro Logo
    - button "Language" [ref=e3]: en
    - text: "Current language is: English"
    - link "Sign up" [ref=e4]:
      - /url: https://miro.com/signup/
  - main:
    - heading "Sign in to Miro" [ref=e5] [level=1]
    - group "Sign-in options":
      - button "SSO" [ref=e6]
      - button "Google" [ref=e7]
      - button "Microsoft 365" [ref=e8]
      - button "Show more sign-in options" [ref=e9]
    - text: Email
    - textbox "Email" [ref=e10]:
      - /placeholder: Enter your email address
    - text: Password
    - textbox "Password" [ref=e11]:
      - /placeholder: Enter your password
    - link "Forgot password?" [ref=e12]:
      - /url: https://miro.com/recover/
    - checkbox "Remember me" [ref=e13] [checked]
    - paragraph:
      - paragraph: Remember me
    - button "Continue with email" [ref=e14]
    - button "Sign in to a different region or a private organization" [ref=e15]
    - paragraph:
      - text: Trouble signing in?
      - link "Request a magic sign in link" [ref=e16]:
        - /url: /login/passwordless/
      - text: or
      - link "reset your password" [ref=e17]:
        - /url: /recover/
      - text: .
  - alert
  - iframe

正常情况下，会输出类似：

@e1 heading "Example Domain"
@e2 paragraph "This domain is for use in illustrative examples..."
@e3 link "More information..."

只要这一步成功，说明你的 macOS + Playwright + agent-browser 环境是健康的。

第五步：实际操作页面（验证 ref-id 模型）

继续在 example.com 上试：

In [ ]:

agent-browser click @e22
agent-browser snapshot

你会看到页面发生变化，ref-id 全部重新分配。这一步非常关键，它验证了你对 agent-browser 核心原则的理解：

ref-id 只在当前 snapshot 有意义

In [ ]:

agent-browser click @e4

In [ ]:

agent-browser fill @e6 "andy@dtype.info"

第六步：测试一个真实复杂页面（推荐）

不要立刻上 Miro、Google 这种登录页，先选一个『复杂但不反爬』的页面，比如：

In [ ]:

agent-browser open https://news.ycombinator.com
agent-browser snapshot

你会看到大量：

@e1 link "Hacker News"
@e2 link "new"
@e3 link "past"
...

试着：

In [ ]:

agent-browser click @e2
agent-browser snapshot

如果这一步稳定，说明 accessibility tree + ref-id 在真实页面上是可用的。

第七步：macOS 上几个非常常见的坑

这里是我强烈建议你提前注意的：

中文输入法干扰 在 fill textbox 时，尽量用英文输入法，否则某些页面会出现奇怪的 key event。
系统权限 第一次运行 Playwright 控制的浏览器时，macOS 可能阻止『自动化控制』。去：系统设置 → 隐私与安全性 → 辅助功能 / 自动化放行你的终端（iTerm / Terminal）。
页面一变就 snapshot 不 snapshot 就继续 click / fill，是 agent-browser 使用中最常见的逻辑错误。

第八步：准备接 LLM（但暂时别急）

在 macOS 上，推荐的下一步架构是：

agent-browser 作为 纯执行层
你自己写一个 controller（Rust / Python / Node 都行）
LLM 只负责：
- 读 snapshot 文本
- 输出下一条命令（严格受限）

现在这个阶段，先不要上 MCP、不要上多 agent，先把『单步 → snapshot → 单步』这条链跑顺。

六、用 agent-browser 操作 Miro 登录页（示例）

agent-browser open https://miro.com/login/
agent-browser snapshot
agent-browser fill @e2 "test@example.com"
agent-browser click @e3
agent-browser snapshot

核心原则只有一句话：
页面一变，就重新 snapshot。
ref-id 是『当前页面状态下的逻辑锚点』，不是全局 ID。

七、agent-browser vs playwright-mcp

维度	agent-browser	playwright-mcp
形态	CLI 工具	MCP Server
面向对象	脚本 / AI Agent	MCP 客户端（IDE / 桌面）
交互方式	命令 + snapshot	工具调用
状态管理	外部控制器	MCP 会话
适合场景	自建 agent loop、 CI、脚本	IDE 内 AI 助手

一句话区分：

agent-browser：浏览器能力 = 一组命令
playwright-mcp：浏览器能力 = MCP 工具集

八、如何把 LLM 和 agent-browser 结合？

典型的 ReAct / Tool Loop：

open 页面
snapshot 获取页面状态
把 snapshot + 目标发给 LLM
LLM 输出下一条命令（受限 DSL）
执行命令
重复直到完成或失败

关键工程要点：

命令白名单
域名限制
步数 / 时间上限
失败自动截图

本质分工是：
LLM 负责决策，agent-browser 负责执行。

使用 agent-browser 搭配 Claude Code 的方法也一并介绍。

通过在 AGENTS.md / CLAUDE.md 中像下面这样写明 agent-browser 的操作方式，就可以让 Claude Code 具备浏览器操作能力。

## Browser Automation

Use `agent-browser` for web automation. Run `agent-browser --help` for all commands.

Core workflow:
1. `agent-browser open <url>` - Navigate to page
2. `agent-browser snapshot -i` - Get interactive elements with refs (@e1, @e2)
3. `agent-browser click @e1` / `fill @e2 "text"` - Interact using refs
4. Re-snapshot after page change

安装 UTM / Lume

安装 Ubuntu Server for ARM / macOS

1 2	[FAILED] Failed unmounting cdrom.mount - /cdrom Please remove the installation medium, then press ENTER

在 UTM 里操作方法如下。

在 UTM 窗口顶部菜单点击虚拟机的设置按钮
找到 CD/DVD 或 Drives
把挂载的 Ubuntu ISO 文件移除（Eject / Clear / 删除 ISO）

做完后，再回到虚拟机窗口按 ENTER。

IFACE=$(ip -o link show | awk -F': ' '{print $2}' | grep -v lo | head -n1)

sudo bash -c "cat > /etc/netplan/01-netcfg.yaml <<EOF
network:
  version: 2
  renderer: networkd
  ethernets:
    ${IFACE}:
      dhcp4: true
EOF"

sudo chmod 600 /etc/netplan/01-netcfg.yaml
sudo netplan generate
sudo netplan apply

1 2	ip a ip route

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
echo >> /home/hijirii/.bashrc
echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv bash)"' >> /home/hijirii/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv bash)"

1	echo "$(whoami) ALL=(ALL) NOPASSWD: ALL" \| sudo tee /etc/sudoers.d/$(whoami) > /dev/null

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

1	nvm install 22 && nvm alias default 22 && nvm use 22

安装并配置 OpenClaw

1 2	npm install -g openclaw@latest openclaw onboard --install-daemon

配置局域网终端访问 OpenClaw Dashboard

配置 Tailscale
配置 Nginx
1.⁠ ⁠配置 Nginx HTTPS 反向代理
2.⁠ ⁠修改 OpenClaw 配置 - 添加 ⁠controlUi.dangerouslyDisableDeviceAuth: true (Device pairing)
3.⁠ ⁠重启 Gateway - 尝试让新配置生效

配置 OpenClaw

禁用 web_search

如果不想使用 Brave API（需要绑定信用卡），可以在配置中禁用 web_search：

1	openclaw config set 'tools.web.search.enabled' false

重启后 web_search 工具将被禁用，web_fetch 仍然可用。如需禁用 web_fetch：

1	openclaw config set 'tools.web.fetch.enabled' false

新增 Skill 时添加环境变量

当创建的 Skill 需要环境变量（如 API 密钥、路径）时，有三种方式配置：

1. 每次调用时传递（推荐自定义变量）

{
  "tool": "exec",
  "command": "agent-browser open https://example.com",
  "env": {
    "AGENT_BROWSER_HOME": "/home/hijirii/.openclaw/workspace/agent-browser",
    "MY_API_KEY": "xxx"
  }
}

2. 配置 PATH 预置目录

1	openclaw config set tools.exec.pathPrepend '["/home/hijirii/.cargo/bin", "/opt/bin"]'

⚠️ 注意：exec 工具不会读取 ~/.bashrc 或 ~/.profile，需要使用上述方式配置。对于 host=gateway，env.PATH 会被拒绝（安全限制），请使用 pathPrepend 代替。

九、整体对照总结

playwright-rs：适合『Rust 工程内的自动化测试』
Accessibility Tree：是稳定自动化与 AI 操作的语义基础
agent-browser：把可访问性语义变成可执行协议
LLM + agent-browser：让浏览器自动化从『写 selector』升级为『做决策』

十、一句话收尾

DOM 是给浏览器画页面用的，
Accessibility Tree 是给『人』理解页面用的，
而 agent-browser，是给 AI 操作页面用的。

这正是现代浏览器自动化正在发生的范式转移。

从 Playwright 到 agent-browser：基于 Accessibility Tree 的浏览器自动化与 AI Agent 实践

一、浏览器自动化的三条路线

二、playwright-rs：在 Rust 中写浏览器自动化测试

一个最小的 Rust 测试用例

三、Accessibility Tree 是什么？为什么它更稳？

四、Selenium 为什么很难支持 Accessibility Tree ？

五、agent-browser：为 AI Agent 设计的浏览器控制层

六、用 agent-browser 操作 Miro 登录页（示例）

七、agent-browser vs playwright-mcp

八、如何把 LLM 和 agent-browser 结合？

配置 OpenClaw

禁用 web_search

新增 Skill 时添加环境变量

九、整体对照总结

十、一句话收尾

Comments