AI-powered mobile automation agent for Android and iOS. Tell it what to do in plain English — it figures out what to tap, type, and swipe.
- Node.js 18+
- Device connected — USB, emulator, or simulator
- Gemini API key from Google AI Studio
npm install -g appclawCreate a .env file in your working directory:
cp .env.example .envgit clone https://github.com/AppiumTestDistribution/appclaw.git
cd appclaw
npm install
cp .env.example .envEdit .env based on your preferred mode:
Vision + Stark (recommended)
Screenshot-first mode using Stark (df-vision + Gemini) for element location. Requires a Gemini API key.
LLM_PROVIDER=gemini
LLM_API_KEY=your-gemini-api-key
LLM_MODEL=gemini-3.1-flash-lite-preview
AGENT_MODE=vision
VISION_LOCATE_PROVIDER=starkVision + Appium MCP
Screenshot-first mode using appium-mcp's server-side AI vision for element location. See appium-mcp AI Vision setup for details.
LLM_PROVIDER=gemini
LLM_API_KEY=your-gemini-api-key
LLM_MODEL=gemini-3.1-flash-lite-preview
AGENT_MODE=vision
VISION_LOCATE_PROVIDER=appium_mcp
AI_VISION_ENABLED=true
AI_VISION_API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai
AI_VISION_API_KEY=your-vision-api-key
AI_VISION_MODEL=gemini-2.0-flashDOM mode
Uses XML page source to find elements by accessibility ID, xpath, etc. No vision needed — works with any LLM provider.
LLM_PROVIDER=gemini # or anthropic, openai, groq, ollama
LLM_API_KEY=your-api-key
AGENT_MODE=dom# Interactive mode
appclaw
# Pass goal directly
appclaw "Open Settings"
appclaw "Search for cats on YouTube"
appclaw "Turn on WiFi"
appclaw "Send hello on WhatsApp to Mom"
# Or with npx (no global install)
npx appclaw "Open Settings"When running from a local clone, use npm start instead:
npm start
npm start "Open Settings"Run declarative steps from a YAML file:
appclaw --flow examples/flows/google-search.yamlAll configuration is via .env:
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
gemini |
LLM provider (currently only gemini is supported for vision) |
LLM_API_KEY |
— | Gemini API key |
LLM_MODEL |
(auto) | Model override (e.g. gemini-2.0-flash) |
AGENT_MODE |
vision |
dom (XML locators) or vision (screenshot-first) |
VISION_LOCATE_PROVIDER |
stark |
Vision backend for locating elements |
MAX_STEPS |
30 |
Max steps per goal |
STEP_DELAY |
500 |
Milliseconds between steps |
SHOW_TOKEN_USAGE |
false |
Print token usage and cost per step |
Each step, AppClaw:
- Perceives — reads the device screen (UI elements or screenshot)
- Reasons — sends the goal + screen state to an LLM, which decides the next action
- Acts — executes the action (tap, type, swipe, launch app, etc.)
- Repeats until the goal is complete or max steps reached
| Action | Description |
|---|---|
tap |
Tap an element |
type |
Type text into an input |
scroll / swipe |
Scroll or swipe gesture |
launch |
Open an app |
back / home |
Navigation buttons |
long_press / double_tap |
Touch gestures |
find_and_tap |
Scroll to find, then tap |
ask_user |
Pause for user input (OTP, CAPTCHA) |
done |
Goal complete |
| Mechanism | What it does |
|---|---|
| Stuck detection | Detects repeated screens/actions, injects recovery hints |
| Checkpointing | Saves known-good states for rollback |
| Human-in-the-loop | Pauses for OTP, CAPTCHA, or ambiguous choices |
| Action retry | Feeds failures back to the LLM for re-planning |
Licensed under the Apache License, Version 2.0. See LICENSE for the full text.
