SmudgeAI

provenance:github:ShaikhWarsi/SmudgeAI

WHAT THIS AGENT DOES

SmudgeAI is an autonomous AI agent designed to control Windows desktop applications. It understands the content displayed on the screen and can navigate various programs, even those it hasn't encountered before. The agent can execute complex, multi-step workflows automatically, reducing the need for manual intervention. Developers and automation engineers would find it useful for streamlining repetitive tasks and building sophisticated desktop automation solutions. Its key strength lies in its ability to adapt to different user interfaces and maintain security through built-in safeguards. SmudgeAI prioritizes reliability and safety, ensuring actions are verified and permissions are managed effectively. It is a production-ready solution with robust logging and error recovery capabilities.

PROBLEM IT SOLVES

SmudgeAI solves the problem of automating complex, multi-step tasks within Windows applications, which are often tedious and error-prone to perform manually. Instead of relying on manual clicks and keystrokes or simpler scripting tools that require extensive customization for each application, users can leverage SmudgeAI's AI-powered reasoning to automate workflows across diverse software.

View Source ↗First seen 1y agoNot yet hireable

CAPABILITIES & CONSTRAINTS

TECH & STACK

pythonwindowsautomationuillmdesktopuiagroq

README

# SmudgeAI

SmudgeAI is a Windows-native autonomous desktop AI agent that understands screen context through vision models and UI element trees, enabling it to navigate any application and execute complex multi-step workflows with minimal human intervention. It leverages LLM-based reasoning to adapt to unknown interfaces while maintaining security through built-in permission systems and safeguards.


## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Core Components](#core-components)
4. [AI Engine & Multimodal Processing](#ai-engine--multimodal-processing)
5. [Desktop Automation Stack](#desktop-automation-stack)
6. [UI Detection & Element Matching](#ui-detection--element-matching)
7. [Security & Safety Systems](#security--safety-systems)
8. [Error Handling & Recovery](#error-handling--recovery)
9. [Multi-Monitor & DPI Scaling Support](#multi-monitor--dpi-scaling-support)
10. [Internationalization & Localization](#internationalization--localization)
11. [Logging & Observability](#logging--observability)
12. [Capabilities Summary](#capabilities-summary)
13. [Comparison with OpenClaw](#comparison-with-openclaw)
14. [Bug Fixes & Security Hardening](#bug-fixes--security-hardening)
15. [Getting Started](#getting-started)
16. [Configuration](#configuration)
17. [Roadmap](#roadmap)

---

## Overview

SmudgeAI is designed to be a **fully autonomous desktop control agent** that can:

- **Understand screen context** through vision models and UI element trees
- **Navigate any Windows application** using UIA-based element discovery
- **Execute multi-step workflows** with verification and rollback
- **Adapt to unknown interfaces** through LLM-based reasoning
- **Operate safely** with permission systems and safeguards

### Key Design Goals

1. **Reliability First** - Every action is verified before proceeding
2. **Security by Default** - Permission systems and input sanitization on all dangerous operations
3. **Cross-UI Adaptability** - Works with any Windows application through UIA and CV fallback
4. **Production Ready** - Structured logging, error recovery, and observability

---

## Architecture

### High-Level System Diagram

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              SmudgeAI Architecture                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐     ┌──────────────────────────────────────────────┐     │
│  │    GUI       │────▶│              AI Engine                        │     │
│  │  (PyQt5)     │     │  ┌─────────────┐  ┌─────────────────────┐   │     │
│  │              │     │  │ Groq Client │  │ Gemini Client        │   │     │
│  │  - Input     │     │  │ (Primary)   │  │ (Fallback)          │   │     │
│  │  - Display   │     │  └─────────────┘  └─────────────────────┘   │     │
│  │  - Status    │     │  ┌─────────────────────────────────────┐   │     │
│  └──────────────┘     │  │ Rate Limiter & Model Cycling          │   │     │
│         │              │  └─────────────────────────────────────┘   │     │
│         │              │  ┌─────────────────────────────────────┐   │     │
│         ▼              │  │ Conversation History Manager        │   │     │
│  ┌──────────────────────────────────────────────────────────────┐ │     │
│  │                    Task Manager                               │ │     │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────────────┐    │ │     │
│  │  │ Permission  │  │ Tool       │  │ Error              │    │ │     │
│  │  │ System      │  │ Registry   │  │ Classifier         │    │ │     │
│  │  └────────────┘  └────────────┘  └────────────────────┘    │ │     │
│  └──────────────────────────────────────────────────────────────┘ │     │
│                              │                                        │
│         ┌────────────────────┼────────────────────┐                 │
│         ▼                    ▼                    ▼                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐        │
│  │ Desktop     │    │ CV/UI       │    │ Local VLM       │        │
│  │ State       │    │ Integration │    │ (Optional)      │        │
│  │ (UIA)       │    │             │    │                 │        │
│  └─────────────┘    └─────────────┘    └─────────────────┘        │
│                              │                                        │
│                              ▼                                        │
│                     ┌─────────────────┐                              │
│                     │  Action         │                              │
│                     │  Execution      │                              │
│                     │  (pyautogui)    │                              │
│                     └─────────────────┘                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Component Responsibilities

| Component | Responsibility | Language/Framework |
|-----------|----------------|-------------------|
| **GUI** | User input, status display, permission dialogs | PyQt5 |
| **AI Engine** | LLM orchestration, vision analysis, tool calling | Python (async) |
| **Task Manager** | Tool execution, permission checks, caching | Python |
| **Desktop State** | UIA element tree, window management | pywinauto + pygetwindow |
| **CV/UI Integration** | Template matching, LLM coordinate verification | OpenCV + PIL |
| **Error Handler** | Error classification, retry strategies | Python |
| **Structured Logging** | Correlation IDs, action tracking | Python (custom) |

---

## Core Components

### 1. AI Engine (`ai_engine.py`)

The AI Engine is the brain of SmudgeAI, orchestrating all LLM interactions.

#### Features

- **Multi-Provider Support**: Groq (primary), Google Gemini (fallback)
- **Vision Analysis**: Llama 3.2 Vision (Groq) → Gemini 1.5 Flash fallback
- **Model Cycling**: Automatic failover when rate limits hit
- **Rate Limiting**: Built-in rate limiter with exponential backoff
- **Conversation History**: Properly serialized message history (dict-based)
- **Tool Schema Generation**: Dynamic tool schema from Python functions

#### Rate Limiter Implementation

```python
class RateLimiter:
    requests_in_window: int      # Requests in current window
    window_size: float = 60.0    # 60-second window
    max_requests: int = 30       # Max 30 requests per window
    blocked_until: float         # Timestamp when block expires
    consecutive_errors: int       # Track consecutive failures
```

**Backoff Strategy**:
- Base delay: 30 seconds
- Exponential: 2^consecutive_errors
- Jitter: Random 0-10 seconds
- Max block: 300 seconds (5 minutes)

#### Vision Pipeline

```
Screenshot → Groq Llama 3.2 Vision → JSON Elements → Coordinate Verification → Click
                ↓ (fallback)
         Gemini 1.5 Flash Vision
```

### 2. Desktop State (`desktop_state.py`)

Captures and maintains the UI element hierarchy of all windows.

#### COM Threading Fix

Previously, pywinauto initialization caused race conditions. Now:

```python
def _ensure_pywinauto_com_init():
    import pythoncom
    pythoncom.CoInitializeEx(None, pythoncom.COINIT_MULTITHREADED)
```

#### Element Types Supported

- WINDOW, BUTTON, EDIT, MENU, MENU_ITEM
- TAB, CHECKBOX, RADIO_BUTTON, COMBOBOX
- LIST, LIST_ITEM, TEXT, UNKNOWN

#### Data Structures

```python
@dataclass
class UIElement:
    title: str
    element_type: ElementType
    rect: tuple          # (x, y, width, height)
    automation_id: str    # UIA AutomationId
    class_name: str       # Window class name
    is_visible: bool
    is_enabled: bool
    children: List[UIElement]

@dataclass
class WindowInfo:
    title: str
    process_name: str
    rect: tuple

[truncated…]

PUBLIC HISTORY

First discoveredApr 1, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 26, 2025

last updatedMar 21, 2026

last crawled1 months ago

version—

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:ShaikhWarsi/SmudgeAI)