AGI

Figuring out what intelligence actually requires. Not by scaling LLMs. Somewhere between biology and brute-force engineering. ARC-AGI-3 is the test I measure against; below is the engine I've built so far.

ARC-AGI-3

The hardest test of fresh-environment reasoning. Frontier LLMs barely score on it.

Got a solver through all 7 levels of a game cold. No neural network, no hardcoded coordinates, no LLM, runs on a laptop. Whether the general architecture ends up using an LLM somewhere, I don't know yet, I'm reverse-engineering the benchmark to find out. General transfer across games is the open problem I'm on now.

Intelligence Framework

No LLM does the thinking. Pure computational operations. No neural network. No GPU. No training. Give it raw data from a game, medical literature, or a codebase — it figures out what's happening and what to do about it.

Operations grounded in how biology solves problems, engineered further where that's what works. Not neural networks mimicking neurons.

Same architecture. No benchmark tuning. Each row = more biology added.

DateVersionBALROG Crafter score
May 48 operations3.44%
May 1010 operations27.80%
May 1012 operations46.87%

13.62× improvement. Zero parameters tuned. 19 of 22 achievements unlocked.

Published scores from balrogai.com, all using LLMs at runtime:

AgentCrafter scoreUses LLM?
Grok-457.3%Yes
Gemini 3.1 Pro Thinking55.0%Yes
Claude Opus 4.549.5%Yes
This architecture46.87%No
Gemini 3.1 Pro46.8%Yes
Gemini 3 Flash45.0%Yes
GPT-5 minimal-think39.1%Yes
DeepSeek-R136.4%Yes

Between GPT-5 and Claude Opus 4.5. Without a neural network.

The same operations that learned to survive in a game also replicate landmark scientific discoveries.

5/5 foundational findings from Swanson's 1986–1996 research — the work that founded computational drug repurposing. Zero false positives across 11 drugs. Real associations rank above false ones 94% of the time. End-to-end: raw PubMed text in, known scientific discovery out. No LLM. 4.7 seconds.

SubstrateWhat it does
Crafter46.87% BALROG, 19/22 achievements, multi-step crafting from raw data
BabyAI86% success with zero training. RL needs millions of frames.
PubMed5/5 Swanson, calibration across 11 drugs
Python codebasesStructure discovery without modification
MathGenerates identities from axioms that weren't specified as targets
Social modelingInfers other agents' goals from observed behavior across 2 substrate types

Same operations. Different data. No retraining.

Can't hallucinate. By construction. Every output traces to observed data. It surfaces what exists or reports nothing.

Edits its own source code to improve — rolls back automatically if a change makes things worse.

Runs on an RTX 4060. No cloud. The speed advantage over LLM agents isn't optimization — it's not needing a neural network.

What it doesn't do yet.

3 of 22 Crafter achievements still at 0%. The hardest ones need 4–5 prerequisites simultaneously. Sequential is solved. Conjunctive isn't. Working on it.

Tested against one 9B model directly. Frontier models compared via published BALROG scores. Medical discoveries are all previously documented — finding something humans genuinely missed is still open. Can't generate natural language yet.

← back to tomek