SmudgeAI
SmudgeAI is an autonomous AI agent designed to control Windows desktop applications. It understands the content displayed on the screen and can navigate various programs, even those it hasn't encountered before. The agent can execute complex, multi-step workflows automatically, reducing the need for manual intervention. Developers and automation engineers would find it useful for streamlining repetitive tasks and building sophisticated desktop automation solutions. Its key strength lies in its ability to adapt to different user interfaces and maintain security through built-in safeguards. SmudgeAI prioritizes reliability and safety, ensuring actions are verified and permissions are managed effectively. It is a production-ready solution with robust logging and error recovery capabilities.
SmudgeAI solves the problem of automating complex, multi-step tasks within Windows applications, which are often tedious and error-prone to perform manually. Instead of relying on manual clicks and keystrokes or simpler scripting tools that require extensive customization for each application, users can leverage SmudgeAI's AI-powered reasoning to automate workflows across diverse software.
CAPABILITIES & CONSTRAINTS
README
# SmudgeAI
SmudgeAI is a Windows-native autonomous desktop AI agent that understands screen context through vision models and UI element trees, enabling it to navigate any application and execute complex multi-step workflows with minimal human intervention. It leverages LLM-based reasoning to adapt to unknown interfaces while maintaining security through built-in permission systems and safeguards.
## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Core Components](#core-components)
4. [AI Engine & Multimodal Processing](#ai-engine--multimodal-processing)
5. [Desktop Automation Stack](#desktop-automation-stack)
6. [UI Detection & Element Matching](#ui-detection--element-matching)
7. [Security & Safety Systems](#security--safety-systems)
8. [Error Handling & Recovery](#error-handling--recovery)
9. [Multi-Monitor & DPI Scaling Support](#multi-monitor--dpi-scaling-support)
10. [Internationalization & Localization](#internationalization--localization)
11. [Logging & Observability](#logging--observability)
12. [Capabilities Summary](#capabilities-summary)
13. [Comparison with OpenClaw](#comparison-with-openclaw)
14. [Bug Fixes & Security Hardening](#bug-fixes--security-hardening)
15. [Getting Started](#getting-started)
16. [Configuration](#configuration)
17. [Roadmap](#roadmap)
---
## Overview
SmudgeAI is designed to be a **fully autonomous desktop control agent** that can:
- **Understand screen context** through vision models and UI element trees
- **Navigate any Windows application** using UIA-based element discovery
- **Execute multi-step workflows** with verification and rollback
- **Adapt to unknown interfaces** through LLM-based reasoning
- **Operate safely** with permission systems and safeguards
### Key Design Goals
1. **Reliability First** - Every action is verified before proceeding
2. **Security by Default** - Permission systems and input sanitization on all dangerous operations
3. **Cross-UI Adaptability** - Works with any Windows application through UIA and CV fallback
4. **Production Ready** - Structured logging, error recovery, and observability
---
## Architecture
### High-Level System Diagram
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SmudgeAI Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────────────────────────┐ │
│ │ GUI │────▶│ AI Engine │ │
│ │ (PyQt5) │ │ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ │ │ Groq Client │ │ Gemini Client │ │ │
│ │ - Input │ │ │ (Primary) │ │ (Fallback) │ │ │
│ │ - Display │ │ └─────────────┘ └─────────────────────┘ │ │
│ │ - Status │ │ ┌─────────────────────────────────────┐ │ │
│ └──────────────┘ │ │ Rate Limiter & Model Cycling │ │ │
│ │ │ └─────────────────────────────────────┘ │ │
│ │ │ ┌─────────────────────────────────────┐ │ │
│ ▼ │ │ Conversation History Manager │ │ │
│ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ Task Manager │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────────────┐ │ │ │
│ │ │ Permission │ │ Tool │ │ Error │ │ │ │
│ │ │ System │ │ Registry │ │ Classifier │ │ │ │
│ │ └────────────┘ └────────────┘ └────────────────────┘ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Desktop │ │ CV/UI │ │ Local VLM │ │
│ │ State │ │ Integration │ │ (Optional) │ │
│ │ (UIA) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Action │ │
│ │ Execution │ │
│ │ (pyautogui) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Component Responsibilities
| Component | Responsibility | Language/Framework |
|-----------|----------------|-------------------|
| **GUI** | User input, status display, permission dialogs | PyQt5 |
| **AI Engine** | LLM orchestration, vision analysis, tool calling | Python (async) |
| **Task Manager** | Tool execution, permission checks, caching | Python |
| **Desktop State** | UIA element tree, window management | pywinauto + pygetwindow |
| **CV/UI Integration** | Template matching, LLM coordinate verification | OpenCV + PIL |
| **Error Handler** | Error classification, retry strategies | Python |
| **Structured Logging** | Correlation IDs, action tracking | Python (custom) |
---
## Core Components
### 1. AI Engine (`ai_engine.py`)
The AI Engine is the brain of SmudgeAI, orchestrating all LLM interactions.
#### Features
- **Multi-Provider Support**: Groq (primary), Google Gemini (fallback)
- **Vision Analysis**: Llama 3.2 Vision (Groq) → Gemini 1.5 Flash fallback
- **Model Cycling**: Automatic failover when rate limits hit
- **Rate Limiting**: Built-in rate limiter with exponential backoff
- **Conversation History**: Properly serialized message history (dict-based)
- **Tool Schema Generation**: Dynamic tool schema from Python functions
#### Rate Limiter Implementation
```python
class RateLimiter:
requests_in_window: int # Requests in current window
window_size: float = 60.0 # 60-second window
max_requests: int = 30 # Max 30 requests per window
blocked_until: float # Timestamp when block expires
consecutive_errors: int # Track consecutive failures
```
**Backoff Strategy**:
- Base delay: 30 seconds
- Exponential: 2^consecutive_errors
- Jitter: Random 0-10 seconds
- Max block: 300 seconds (5 minutes)
#### Vision Pipeline
```
Screenshot → Groq Llama 3.2 Vision → JSON Elements → Coordinate Verification → Click
↓ (fallback)
Gemini 1.5 Flash Vision
```
### 2. Desktop State (`desktop_state.py`)
Captures and maintains the UI element hierarchy of all windows.
#### COM Threading Fix
Previously, pywinauto initialization caused race conditions. Now:
```python
def _ensure_pywinauto_com_init():
import pythoncom
pythoncom.CoInitializeEx(None, pythoncom.COINIT_MULTITHREADED)
```
#### Element Types Supported
- WINDOW, BUTTON, EDIT, MENU, MENU_ITEM
- TAB, CHECKBOX, RADIO_BUTTON, COMBOBOX
- LIST, LIST_ITEM, TEXT, UNKNOWN
#### Data Structures
```python
@dataclass
class UIElement:
title: str
element_type: ElementType
rect: tuple # (x, y, width, height)
automation_id: str # UIA AutomationId
class_name: str # Window class name
is_visible: bool
is_enabled: bool
children: List[UIElement]
@dataclass
class WindowInfo:
title: str
process_name: str
rect: tuple
[truncated…]PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
