Benchmarking platform for evaluating how effectively large language models explore and reason about data. Features a live leaderboard, task registry, CLI documentation, and agent performance visualizations.

Build a clean, research-grade web platform that presents LLM benchmark results in a way researchers and practitioners can actually use. The challenge was designing an interface that communicates complex model comparison data clearly — with a live leaderboard, task browsing, and citation tooling — while keeping the experience fast and accessible.
Developed a full-stack benchmark website using Next.js 16 with App Router that surfaces live leaderboard rankings, a browsable task registry, and animated agent performance visualizations. Task data is pulled dynamically from GitHub and presented in a minimal, research-grade UI.
Built an MDX-powered documentation system integrated directly into the app, covering CLI installation, usage guides, and first-steps walkthroughs — with consistent navigation across the entire site.
Built an animated agent performance chart ranked by task resolution success rate. The leaderboard tracks model, organization, score, task count, and version with smooth bar animations on load using GSAP.
Implemented dynamic task fetching from GitHub to power a filterable task browser. Tasks are organized by domain, with featured tasks surfaced on the homepage.
Built a docs site using MDX integrated into Next.js with sidebar navigation covering installation, CLI reference, and first-steps guides — consistent with the rest of the site's design system.
Implemented one-click copy for both plain text and BibTeX citation formats with toast feedback, supporting proper academic attribution.
Delivered a polished, research-grade platform that makes complex benchmark data easy to explore and cite — giving the eda-bench project a credible public face and a solid foundation for community growth.