AGI
Figuring out what intelligence actually requires. Not by scaling LLMs. Somewhere between biology and brute-force engineering. ARC-AGI-3 is the test I measure against; below is the engine I've built so far.
The hardest test of fresh-environment reasoning. Frontier LLMs barely score on it.
Got a solver through all 7 levels of a game cold. No neural network, no hardcoded coordinates, no LLM, runs on a laptop. Whether the general architecture ends up using an LLM somewhere, I don't know yet, I'm reverse-engineering the benchmark to find out. General transfer across games is the open problem I'm on now.
No LLM does the thinking. Pure computational operations. No neural network. No GPU. No training. Give it raw data from a game, medical literature, or a codebase — it figures out what's happening and what to do about it.
Operations grounded in how biology solves problems, engineered further where that's what works. Not neural networks mimicking neurons.
Same architecture. No benchmark tuning. Each row = more biology added.
| Date | Version | BALROG Crafter score |
|---|---|---|
| May 4 | 8 operations | 3.44% |
| May 10 | 10 operations | 27.80% |
| May 10 | 12 operations | 46.87% |
13.62× improvement. Zero parameters tuned. 19 of 22 achievements unlocked.
Published scores from balrogai.com, all using LLMs at runtime:
| Agent | Crafter score | Uses LLM? |
|---|---|---|
| Grok-4 | 57.3% | Yes |
| Gemini 3.1 Pro Thinking | 55.0% | Yes |
| Claude Opus 4.5 | 49.5% | Yes |
| This architecture | 46.87% | No |
| Gemini 3.1 Pro | 46.8% | Yes |
| Gemini 3 Flash | 45.0% | Yes |
| GPT-5 minimal-think | 39.1% | Yes |
| DeepSeek-R1 | 36.4% | Yes |
Between GPT-5 and Claude Opus 4.5. Without a neural network.
The same operations that learned to survive in a game also replicate landmark scientific discoveries.
5/5 foundational findings from Swanson's 1986–1996 research — the work that founded computational drug repurposing. Zero false positives across 11 drugs. Real associations rank above false ones 94% of the time. End-to-end: raw PubMed text in, known scientific discovery out. No LLM. 4.7 seconds.
| Substrate | What it does |
|---|---|
| Crafter | 46.87% BALROG, 19/22 achievements, multi-step crafting from raw data |
| BabyAI | 86% success with zero training. RL needs millions of frames. |
| PubMed | 5/5 Swanson, calibration across 11 drugs |
| Python codebases | Structure discovery without modification |
| Math | Generates identities from axioms that weren't specified as targets |
| Social modeling | Infers other agents' goals from observed behavior across 2 substrate types |
Same operations. Different data. No retraining.
Can't hallucinate. By construction. Every output traces to observed data. It surfaces what exists or reports nothing.
Edits its own source code to improve — rolls back automatically if a change makes things worse.
Runs on an RTX 4060. No cloud. The speed advantage over LLM agents isn't optimization — it's not needing a neural network.
What it doesn't do yet.
3 of 22 Crafter achievements still at 0%. The hardest ones need 4–5 prerequisites simultaneously. Sequential is solved. Conjunctive isn't. Working on it.
Tested against one 9B model directly. Frontier models compared via published BALROG scores. Medical discoveries are all previously documented — finding something humans genuinely missed is still open. Can't generate natural language yet.