TwoShakes

provenance:github:Steve-Git9/TwoShakes

WHAT THIS AGENT DOES

TwoShakes is an AI-powered agent designed to streamline data preparation. It allows users to upload messy files, review and adjust AI-generated cleaning plans, and then download analysis-ready datasets. The agent leverages Azure AI Foundry and an 8-agent orchestration system to automate the data cleaning and feature engineering process. Data scientists and analysts who need to prepare data for machine learning projects would find TwoShakes particularly useful. It simplifies a traditionally time-consuming and complex task.

PROBLEM IT SOLVES

TwoShakes solves the problem of manually cleaning and preparing messy data for analysis. Instead of spending hours or days manually cleaning data, users can leverage the agent's AI capabilities to automate the process and quickly obtain datasets ready for analysis.

View Source ↗First seen 4mo agoNot yet hireable

CAPABILITIES & CONSTRAINTS

TECH & STACK

pythonazuredata-cleaningmachine-learningllm-agentstreamlit

README

# Two Shakes Data Cleaning
<img src="https://raw.githubusercontent.com/Steve-Git9/TwoShakes/main/frontend/static/tsLogo.png" width="150"/>

AI-Powered Data Preparation: From Messy to Analysis-Ready in **Two Shakes of a Lamb's tail**

![Python](https://img.shields.io/badge/Python-3.11+-blue)
![Azure](https://img.shields.io/badge/Azure-Deployed-0078D4)
![Microsoft Foundry](https://img.shields.io/badge/Microsoft_Foundry-Powered-purple)
![Agent Framework](https://img.shields.io/badge/Azure_AI_Agents-azure--ai--projects-green)
![MCP](https://img.shields.io/badge/MCP-Server_Enabled-orange)
![License](https://img.shields.io/badge/License-MIT-green)

> 🏗️ Built for the **Microsoft Purpose-Built AI Platform Hackathon**
> Category: **Best Use of Microsoft Foundry** · Also targeting: **Best Multi-Agent System** · **Best Enterprise Solution**

---

# Azure Deployment Link

https://dataprepagent-499e361a.azurewebsites.net/

---

## DEMO VIDEO

https://youtu.be/wbrhYIQtNJQ

---

<img src="https://raw.githubusercontent.com/Steve-Git9/TwoShakes/main/docs/gif_TS.gif" width="600"/>

---

## How It Works in 60 Seconds

1. **Upload** any messy data file — CSV, Excel, JSON, XML, or even a PDF with tables
2. **AI profiles** every column: detects types, missing values, outliers, duplicates, and scores quality 0–100
3. **Review a cleaning plan** — approve, reject, or tweak each AI-proposed action before anything runs
4. **Optionally prepare for ML** — the AI recommends encoding, scaling, and feature transforms tailored to your data
5. **Download** your analysis-ready dataset as CSV, Excel, or Parquet

The LLM decides *what* to fix. Python executes it deterministically. Your data, your call: nothing changes without your approval.

---

## Microsoft Hero Technologies — Where & How They're Used

This section maps every required hackathon technology to the exact source files that implement it.

### ☁️ Microsoft Foundry (Azure AI Foundry)

All LLM calls in DataPrepAgent go through models hosted on Microsoft Foundry. The single LLM client in [`src/agents/__init__.py`](src/agents/__init__.py) connects to the Foundry endpoint using the `AZURE_AI_PROJECT_ENDPOINT` and `AZURE_AI_MODEL_DEPLOYMENT_NAME` environment variables. Three agents make LLM calls — the Profiler (semantic analysis), the Strategy Agent (cleaning plan generation), and the Validator (quality certificate) — plus the Feature Engineering Agent for ML recommendations. Azure Foundry's built-in content filters are active on every call.

**Key code:**
```python
# src/agents/__init__.py — AgentClient constructor
client = AIProjectClient(
    endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT"),
    credential=AzureKeyCredential(os.getenv("AZURE_AI_PROJECT_KEY"))
)
# Creates Azure AI Agent with per-call threads
agent = client.agents.create_agent(
    model=os.getenv("AZURE_AI_MODEL_DEPLOYMENT_NAME"),
    name=self.name,
    instructions=self.instructions
)
```

### 🤖 Microsoft Agent Framework (`azure-ai-projects`)

The [`AgentClient`](src/agents/__init__.py) class uses `azure.ai.projects.AIProjectClient` as its tier-1 backend — the actual Microsoft Agent Framework SDK. It creates real Azure AI Agents with per-call threads and message-based conversations. If the SDK is unavailable (e.g., in environments without the preview package), it falls back gracefully to `openai.AzureOpenAI` pointing at the same Foundry-hosted model. Every agent in the system (Profiler, Strategy, Cleaner, Validator, Feature Engineering, Feature Transformer) uses this single client.

**Files:** [`src/agents/__init__.py`](src/agents/__init__.py) · [`src/agents/orchestrator_agent.py`](src/agents/orchestrator_agent.py) · [`src/agents/profiler_agent.py`](src/agents/profiler_agent.py) · [`src/agents/strategy_agent.py`](src/agents/strategy_agent.py) · [`src/agents/validator_agent.py`](src/agents/validator_agent.py) · [`src/agents/feature_engineering_agent.py`](src/agents/feature_engineering_agent.py) · [`src/agents/feature_transformer_agent.py`](src/agents/feature_transformer_agent.py)

### 🔌 MCP Server (7 tools)

[`src/mcp_server.py`](src/mcp_server.py) exposes the full pipeline as 7 MCP tools via stdio transport. Any MCP-compatible client — including **GitHub Copilot Agent Mode** — can call these tools programmatically. The tools are: `profile_data`, `suggest_cleaning_plan`, `clean_data`, `validate_cleaning`, `list_supported_formats`, `recommend_feature_engineering`, `apply_feature_engineering`.

### 🧑‍💻 GitHub Copilot Agent Mode

The repo includes a [`.vscode/mcp.json`](.vscode/mcp.json) configuration file that registers DataPrepAgent's MCP server as a tool source for GitHub Copilot Agent Mode in VS Code. With this config, a developer can ask Copilot: *"Profile the file test_data/messy_sales.csv and suggest a cleaning plan"* and Copilot will call the MCP tools automatically.

```json
// .vscode/mcp.json — already in the repo
{
  "servers": {
    "dataprepagent": {
      "command": "python",
      "args": ["src/mcp_server.py"],
      "env": { "AZURE_AI_PROJECT_ENDPOINT": "...", "AZURE_AI_PROJECT_KEY": "...", "AZURE_AI_MODEL_DEPLOYMENT_NAME": "gpt-4o-mini" }
    }
  }
}
```

### 📄 Azure AI Document Intelligence

[`src/parsers/pdf_parser.py`](src/parsers/pdf_parser.py) uses Azure AI Document Intelligence's `prebuilt-layout` model to extract tables from PDF files and scanned images. This is a second Azure AI service beyond the LLM, demonstrating multi-service integration on the Azure platform.

### ☁️ Azure App Service (Deployment)

[`infra/deploy.sh`](infra/deploy.sh) provides one-command deployment to Azure App Service. The script creates a resource group, App Service plan (B1 Linux), web app with Python 3.11 runtime, configures all environment variables, and deploys the code. [`startup.sh`](startup.sh) runs Streamlit on port 8000 for the Azure container. Full step-by-step instructions in [`infra/azure-deployment.md`](infra/azure-deployment.md).

---

## Architecture — 8-Agent Orchestrated Pipeline

![Architecture](docs/architecture.png)

**Agentic design patterns used:**
- **Multi-agent collaboration**: 8 specialized agents, each with a single responsibility
- **Agent-to-agent messaging**: Orchestrator sends structured `AgentMessage` objects to sub-agents
- **Orchestrator supervisor**: Central coordinator drives the pipeline, manages state, handles errors
- **Self-healing retry loop**: If quality score < target after cleaning, Orchestrator re-runs Strategy + Cleaner (up to 2 retries)
- **Human-in-the-loop checkpoints**: Pipeline pauses twice for user approval (cleaning plan + FE plan)
- **Tool-using agents**: MCP server exposes all agent capabilities as callable tools
- **Deterministic execution**: LLM reasons about *what* to do; Python code executes it. No AI-generated data values.

---

## The Problem

Data scientists spend **60–80% of their time** on data cleaning and preparation. Messy CSVs with mixed date formats, Excel exports with merged cells, nested JSON APIs with missing fields — every dataset needs hours of manual wrangling before any real analysis can begin.

## The Solution

DataPrepAgent automates the entire data preparation pipeline using **8 AI agents** orchestrated by a supervisor. Upload a messy file, get a detailed quality report, review the AI-generated cleaning plan action by action, then optionally apply ML feature engineering — all in minutes.

---

## What Makes This Different

**🧠 LLM reasons, Python executes.**
The model analyzes your data and proposes a plan. But actual transformations are deterministic pandas and scikit-learn functions. The AI never generates or modifies data values directly — no hallucinated data, no surprises.

**👤 Human-in-the-loop at every decision point.**
Both the cleaning plan and the feature engineering plan are presented as reviewable lists. Toggle each action on or off. Edit fill strategies. Change scaling methods. Nothing runs until you approve it.

**🔄 Self-healing pipeline.**
If the cleaned data

[truncated…]

PUBLIC HISTORY

First discoveredMar 21, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 7, 2026

last updatedMar 14, 2026

last crawled3 months ago

version—

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:Steve-Git9/TwoShakes)