hallucination-elimination-benchmark
The hallucination-elimination-benchmark is a multi-tier benchmark designed to eliminate large language model (LLM) hallucinations. It evaluates models like Claude 4.6, GPT-5.2, Mistral 7B, and Gemini 2.5 Pro across a dataset of 222 adversarial question-answer pairs focused on Ancient Rome. The benchmark achieves high accuracy (95-100%) and includes novel topological paradox detection. This tool is useful for developers and researchers seeking to assess and improve the reliability of LLMs.
This agent solves the problem of LLM hallucinations, which can lead to inaccurate and unreliable information. It provides a model-agnostic benchmark to automatically evaluate LLMs, saving developers time and effort compared to manual evaluation processes.
CAPABILITIES & CONSTRAINTS
PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
