githubinferredactive
LLM-Business
provenance:github:AlMIGHTY-HARDIK/LLM-Business
An autonomous enterprise AI data analyst agent built with Python, Groq, and Streamlit.
README
# 🧠 AI Analyst: Enterprise Edition
[](https://www.python.org/downloads/)
[](https://streamlit.io)
[](https://groq.com)
[](https://opensource.org/licenses/MIT)
An autonomous, compound AI data analysis system. This application ingests raw enterprise data (CSV/Excel), dynamically generates and executes Python code to find insights, and synthesizes the results into executive-ready narratives.
🚀 **[Try the Live Application Here](https://llm-business-ainocular.streamlit.app/)** 🚀
Designed with a modular, scalable architecture separating the frontend UI from the AI orchestration and data manipulation layers.
---
## ✨ Enterprise Features
- **Autonomous Code Execution:** Translates natural language queries into executable Pandas and Plotly code.
- **Self-Healing Logic:** Features an automatic retry and self-correction loop. If the AI-generated code encounters a Python runtime error, the engine reads the traceback and rewrites the code to fix the issue dynamically.
- **Adaptive Data Ingestion:** Automatically standardizes messy headers, parses complex date formats, and cleans dirty currency strings (e.g., stripping `$`, `₹`, `,`).
- **Pre-Execution Health Checks:** Scans data for heavily null columns and warns the LLM to prevent data hallucinations.
- **Executive Persona:** Generates final outputs using a predefined elite strategy consultant prompt (McKinsey/BCG style) to ensure actionable, business-focused insights.
---
## 🏗️ Architecture & Core Components
The application follows a strict separation of concerns, divided into UI, configuration, data management, and LLM orchestration.
```text
ai_analyst_app/
├── app.py # User Interface (Frontend)
├── requirements.txt # Project Dependencies
└── src/ # Backend Engine
├── __init__.py
├── config.py # Environment & Persona configuration
├── data_engine.py # Data Ingestion & Normalization
└── llm_engine.py # LLM Orchestration & Sandboxed Execution
```
i). The Frontend UI (app.py)
Built with Streamlit, this serves as the presentation layer. It handles file uploads, maintains the chat history state, tracks token usage, and dynamically renders dataframes and Plotly charts based on the backend's output.
ii). Configuration (src/config.py)
Acts as the central nervous system for environment variables and system prompts. It securely loads the Groq API keys and houses the get_narrative_prompt, enforcing strict "hallucination checks" and structuring the AI's final narrative output.
iii). Data Engine (src/data_engine.py)
The deterministic data ingestion layer.
- **load_and_adapt_data**: Uses Streamlit caching (@st.cache_data) to prevent redundant processing. It normalizes headers and intelligently casts data types.
- **get_data_health_report**: Analyzes missing values and injects warnings directly into the LLM context window.
iv). LLM & Execution Engine (src/llm_engine.py)
The core reasoning component powered by Groq (Llama-3.3-70b-versatile).
- **generate_code_prompt**: Provides the AI with the DataFrame's "DNA" (schema) and specific coding constraints.
- **execute_with_self_correction**: Uses a MAX_RETRIES loop to execute the AI-generated code using Python's exec() function in a controlled environment. If an error occurs, the exception is passed back to the LLM for correction.
- **generate_narrative**: Takes the mathematical results of the code execution and passes them to a secondary LLM call to synthesize the final strategic summary.
---
## ✨ Deep-Dive: Core Enterprise Features
The true power of this application lies in its "Agentic" backend. Rather than relying on a single LLM call to guess answers, the system utilizes a multi-step, tool-use pipeline.
1. 🤖 Autonomous Code Execution
Standard LLMs struggle with precise mathematics on large datasets. This system bypasses that limitation entirely. When a user asks a question (e.g., "Show me the top 5 regions by gross revenue"), the AI does not attempt to calculate the answer. Instead, it acts as a Lead Data Scientist, translating the natural language query into highly optimized pandas data manipulation code and plotly.express charting logic. The application then securely executes this code in a sandboxed dictionary environment (exec(code, env)), ensuring that all math is performed deterministically by the machine's CPU, guaranteeing 100% accuracy.
2. 🔄 Self-Healing Logic & Auto-Correction
Generated code is prone to runtime or syntax errors, especially with complex data structures. To ensure enterprise reliability, the llm_engine implements an automatic retry loop (MAX_RETRIES). If the executed Python code fails, the engine catches the exception (e.g., KeyError, SyntaxError) and feeds the exact traceback back to the Groq LLM along with the prompt: "PREVIOUS ATTEMPT FAILED. Rewrite the code to fix this specific error." The AI then debugs its own code and re-submits it. This creates a highly resilient system that rarely fails on the user's end.
3. 🧹 Adaptive Data Ingestion
Real-world enterprise data is inherently messy. The data_engine.py pipeline normalizes data before the AI ever sees it:
- Header Standardization: Automatically strips spaces, special characters, and casing from column names (e.g., Sales (USD) becomes sales_usd), preventing KeyError exceptions during code generation.
- Currency Cleaning: Uses Regex to detect and strip financial symbols ($, ₹, ,) and forcefully casts string columns to numeric floats so they can be aggregated mathematically.
- Smart Date Parsing: Scans for keywords like "date" or "invoice" and utilizes dynamic datetime parsing, falling back to dayfirst=True if standard ISO formats fail.
4. 🛡️ Pre-Execution Health Checks & Guardrails
To prevent AI "hallucinations" (where the model invents data or makes assumptions about missing fields), the system scans the dataset immediately upon upload. The get_data_health_report function evaluates the sparsity of every column. If a column is missing more than 50% of its data, it injects a CRITICAL WARNING directly into the LLM's system prompt (e.g., "Column 'discount_amount' is 85% empty. DO NOT USE IT"). This forces the AI to route its logic around corrupt or missing data, ensuring reliable insights.
5. 👔 Executive Persona Synthesis
Raw numbers are not actionable without context. Once the deterministic Python code generates the metrics (summary_stats) and the filtered dataframe (result_df), a secondary LLM pipeline is triggered. Using a predefined "Elite Strategy Consultant (McKinsey/BCG)" persona housed in config.py, the system forces the AI to output a structured narrative:
i) An Executive Headline
ii) Evidence & Analysis
iii) Strategic Recommendations
Crucially, the prompt includes strict Hallucination Checks (e.g., "Look at the columns... DO NOT mention '2025' unless explicitly in the table"), anchoring the narrative strictly to the mathematically proven output of the previous step.
---
## 🚀 Installation & Quick Start
Prerequisites
- Python 3.9 or higher
- A Groq API Key
Installation
1. Clone the repository
```text
git clone [https://github.com/yourusername/ai-analyst-enterprise.git](https://github.com/yourusername/ai-analyst-enterprise.git)
cd ai-analyst-enterprise
```
2. Set up a virtual environment
```text
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. Install dependencies
```text
pip install -r requirements.txt
```
4. Configure Environment Variables
Create a .streamlit/secrets.toml file in the root directory and add your Groq API key:
```text
GROQ_API_KEY = "gsk_your_api_key_here"
```
Running the Applica
[truncated…]PUBLIC HISTORY
First discoveredMar 21, 2026
IDENTITY
inferred
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
platformgithub
first seenDec 28, 2025
last updatedMar 20, 2026
last crawled13 days ago
version—
README BADGE
Add to your README:
