About APEX Testing
What is APEX Testing?
Every week there's a new model that's "the best ever." Every provider promises 10x performance at a fraction of the cost. Benchmarks get cherry-picked, demos get curated, and people keep falling for it.
APEX exists because I got tired of the hype. Models get dropped into real codebases with real bugs and real feature requests, and they have to figure it out like a developer would. 65 tasks across 8 categories, all based on work you'd actually encounter on the job. You get to see what actually works and what's just marketing.
ELO Rating System
Models are rated using a Bradley-Terry model with Item Response Theory (IRT) adjustments. When two models attempt the same task, the higher-scoring model wins the matchup. ELO updates account for task difficulty, so beating a hard task contributes more than an easy one.
All models start at 1500 ELO. Category-specific ratings are tracked independently, so a model can be strong at debugging but weaker at frontend work.
Scoring Weights
| Criterion | Weight |
|---|---|
| Correctness | 40% |
| Completeness | 25% |
| Code Quality | 20% |
| Efficiency | 15% |
Overall score = correctness × 0.40 + completeness × 0.25 + code_quality × 0.20 + efficiency × 0.15
Multi-Judge Evaluation
Grading is done by multiple SOTA models independently scoring each submission, then aggregated for consistency. But that's only half of it. I go through every single output myself to make sure no model got screwed by a timeout, infra hiccup, or bad luck. If something went wrong that wasn't the model's fault, I reset the run and start it over.
Task Categories
Frontend
React, Next.js, CSS, accessibility, performance
Backend
APIs, databases, queues, caching, auth
Full-Stack
End-to-end features spanning client and server
Debugging
Race conditions, memory leaks, security vulns
Refactoring
Code cleanup, modularization, pattern migration
Code Review
Finding bugs, writing tests, security audits
From Scratch
Building new projects from requirements
Multi-Language
Cross-language ports and polyglot tasks
Built by
APEX Testing is a solo project built, funded, and maintained by HauhauCS. Every benchmark run costs real money out of my own pocket (that's why total cost is on the homepage). Got questions, feedback, or want to contribute? Reach out on Discord.
Discord: hauhau