APEX

About APEX Testing

What is APEX Testing?

Every week there's a new model that's "the best ever." Every provider promises 10x performance at a fraction of the cost. Benchmarks get cherry-picked, demos get curated, and people keep falling for it.

APEX exists because I got tired of the hype. Models get dropped into real codebases with real bugs and real feature requests, and they have to figure it out like a developer would. 65 tasks across 8 categories, all based on work you'd actually encounter on the job. You get to see what actually works and what's just marketing.

ELO Rating System

Models are rated using a Bradley-Terry model with Item Response Theory (IRT) adjustments. When two models attempt the same task, the higher-scoring model wins the matchup. ELO updates account for task difficulty, so beating a hard task contributes more than an easy one.

All models start at 1500 ELO. Category-specific ratings are tracked independently, so a model can be strong at debugging but weaker at frontend work.

Scoring Weights

CriterionWeight
Correctness40%
Completeness25%
Code Quality20%
Efficiency15%

Overall score = correctness × 0.40 + completeness × 0.25 + code_quality × 0.20 + efficiency × 0.15

Multi-Judge Evaluation

Grading is done by multiple SOTA models independently scoring each submission, then aggregated for consistency. But that's only half of it. I go through every single output myself to make sure no model got screwed by a timeout, infra hiccup, or bad luck. If something went wrong that wasn't the model's fault, I reset the run and start it over.

Task Categories

Frontend

React, Next.js, CSS, accessibility, performance

Backend

APIs, databases, queues, caching, auth

Full-Stack

End-to-end features spanning client and server

Debugging

Race conditions, memory leaks, security vulns

Refactoring

Code cleanup, modularization, pattern migration

Code Review

Finding bugs, writing tests, security audits

From Scratch

Building new projects from requirements

Multi-Language

Cross-language ports and polyglot tasks

Built by

APEX Testing is a solo project built, funded, and maintained by HauhauCS. Every benchmark run costs real money out of my own pocket (that's why total cost is on the homepage). Got questions, feedback, or want to contribute? Reach out on Discord.

Discord: hauhau