Building Marionette: Can Small On-Device LLMs Handle Complex Tasks?

The Google Chrome Built-in AI Challenge 2025 posed an interesting question for me, but not the one the organizers intended. While most participants focused on what they could build with Chrome's new AI APIs, I was fixated on something more fundamental: can a small, resource-constrained on-device model handle genuinely complex tasks if we hyper-optimize its inputs and context management?

The Thesis

Large language models running in the cloud have seemingly unlimited context windows and massive parameter counts. But they come with a cost: your data flows through external servers, gets logged, analyzed, and increasingly feeds the training pipelines of AI labs. The industry's hunger for data is insatiable.

I wanted to prove that a 3-billion parameter model running entirely on-device could accomplish meaningful agentic work—browser automation, semantic memory, document retrieval—without sending a single byte to the cloud. Not as a compromise, but as a genuine alternative.

Marionette became that proof of concept: a Chrome extension that automates web browsing through natural language, operates 100% offline after initial setup, and maintains absolute user privacy.

The Architecture Problem

Gemini Nano ships with a 9,216-token context window. That sounds reasonable until you realize that a single webpage's accessibility tree can easily consume 6,000+ tokens. Add conversation history, system prompts, tool definitions, and memory retrieval, and you've overflowed before the model even thinks about responding.

The traditional approach—just dump everything into the prompt—was impossible. I needed to fundamentally rethink how information flows into the model.

The solution came from aggressive context engineering:

Playbook-based tool loading. Instead of exposing all 22 automation tools in the system prompt (~2,400 tokens), I exposed 9 core tools (~850 tokens) and loaded specialized toolsets on-demand. When the agent needs to fill a form, it requests the form-filling playbook, which injects domain-specific context and unlocks relevant tools. This alone saved ~1,550 tokens—17% of the entire context window.

Chunk-based semantic retrieval. Every webpage the user visits gets captured, cleaned with Readability.js, split into 500-character overlapping chunks, and embedded using Transformers.js running all-MiniLM-L6-v2. When the agent needs information, it retrieves only the most semantically relevant chunks—not entire pages. A 5,000-word article that would consume 6,500 tokens becomes three relevant paragraphs at ~200 tokens.

Intelligent summarization triggers. When conversation history approaches 80% of the context window, Chrome's Summarizer API compresses the history while preserving critical state: which form fields were filled, what values the user provided, what actions remain. The agent continues seamlessly without losing track of multi-step workflows.

The Tooling Revelation

The most significant insight came from tool design. Small models don't reason like large models. They need explicit guidance, not abstract capabilities.

Early iterations exposed generic tools: "query the DOM," "execute JavaScript," "parse the page." The model would get lost in possibilities, hallucinate element selectors, and struggle to connect observations to actions.

The breakthrough was building tools that matched how the model naturally thinks. Instead of "find element by CSS selector," I built findElements with a natural language query: "search button," "email input field," "submit." The tool searches the accessibility tree by semantic similarity and returns numbered references. The model sees [12] Button: "Submit" and calls clickElement with index 12. No guessing at selectors. No XPath confusion.

This pattern extended everywhere. captureScreenshot returns a visual snapshot the model can reason about. searchVault takes a natural question and returns relevant content chunks. getPlaybook loads domain knowledge when the agent recognizes it needs specialized help.

The toolset you provide fundamentally shapes what the model can accomplish. Generic tools produce generic failures. Purpose-built tools unlock genuine capability.

Taming the Small Model

Gemini Nano occasionally forgets its instructions. It hallucinates phone numbers. It gets stuck in loops calling the same tool repeatedly. These aren't bugs—they're characteristics of small models that must be engineered around.

Format drift correction. The model outputs tool calls in a strict XML format. After long conversations, it sometimes wraps calls in code blocks or forgets closing braces. The parser detects malformed syntax and returns explicit correction messages: "STOP using code blocks. Write the function_call tag directly." The model self-corrects 90% of the time within one iteration.

Loop detection. When the agent calls the same tool three times consecutively, or enters a cyclic pattern (A → B → C → A → B → C), the system injects: "LOOP DETECTED. Stop calling tools and describe what you've learned." This breaks the repetition and forces the model to synthesize its observations.

Hallucination guardrails. When filling forms, the model occasionally invents plausible-looking data. Pattern detection catches fake phone numbers and email addresses, warning the agent before it submits garbage.

These aren't elegant solutions. They're pragmatic ones. Small models require scaffolding that large models don't need. The question is whether the scaffolding can be robust enough to deliver reliable results. For Marionette, the answer turned out to be yes.

The Privacy Stance

After the one-time model download (~2GB for Gemini Nano, ~23MB for the embeddings model), Marionette operates completely offline. No API keys. No telemetry. No analytics. You can verify this by watching the Network tab in DevTools during normal operation—zero requests leave your machine.

This wasn't just a technical decision. AI companies are accumulating unprecedented amounts of user data. Every query to ChatGPT, every document uploaded to Claude, every conversation with Gemini feeds training pipelines that we have no visibility into. The value exchange is increasingly asymmetric: we provide intimate data about our lives; they provide a service that could, at any moment, start using that data in ways we never anticipated.

On-device inference flips this dynamic. Your browsing history, your documents, your form data, your voice commands—none of it leaves your device. The privacy guarantee isn't a policy promise; it's an architectural fact.

The tradeoff is performance. Cloud inference returns in under a second; on-device takes 1-3 seconds. But users consistently tell me they're willing to wait for the guarantee that their data stays private. That's a meaningful signal about what people actually value.

Results

The Google Chrome Built-in AI Challenge attracted over 14,000 participants. Marionette won Honorable Mention—a result I'm genuinely proud of, even if I hoped for more.

Competing against that many developers validates the technical approach. The judges recognized that this wasn't just a demo of Chrome's APIs; it was a serious attempt to answer whether on-device models can handle complex agentic workflows. The answer, with enough engineering effort, is yes.

What's Next

Marionette will remain a research project. Chrome's built-in AI capabilities are still evolving—new APIs ship regularly, model capabilities improve, context windows may expand. Each advancement opens new possibilities for what on-device agents can accomplish.

The codebase is open source. If you're interested in on-device AI, browser automation, or privacy-preserving machine learning, I'd welcome contributions and conversations.

The broader question—can small models replace cloud inference for complex tasks—remains open. Marionette proves it's possible in constrained domains with sufficient engineering investment. Whether that investment is worthwhile compared to just calling an API depends on how much you value privacy, offline capability, and independence from external services.

For me, the answer is clear. The future of AI should include options that respect user autonomy. Marionette is my small contribution toward that future.