AGI

Figuring out what intelligence actually requires. Not by scaling LLMs. Somewhere between biology and brute-force engineering. ARC-AGI-3 is the test I measure against; below is the engine I've built so far.

ARC-AGI-3

The hardest test of fresh-environment reasoning. Frontier LLMs barely score on it.

Got a solver through all 7 levels of a game cold. No neural network, no hardcoded coordinates, no LLM, runs on a laptop. Whether the general architecture ends up using an LLM somewhere, I don't know yet, I'm reverse-engineering the benchmark to find out. General transfer across games is the open problem I'm on now.

Intelligence Framework

No LLM does the thinking. Pure computational operations. No neural network. No GPU. No training. Give it raw data from a game, medical literature, or a codebase — it figures out what's happening and what to do about it.

Operations grounded in how biology solves problems, engineered further where that's what works. Not neural networks mimicking neurons.

Same architecture. No benchmark tuning. Each row = more biology added.

Date	Version	BALROG Crafter score
May 4	8 operations	3.44%
May 10	10 operations	27.80%
May 10	12 operations	46.87%

13.62× improvement. Zero parameters tuned. 19 of 22 achievements unlocked.

Published scores from balrogai.com, all using LLMs at runtime:

Agent	Crafter score	Uses LLM?
Grok-4	57.3%	Yes
Gemini 3.1 Pro Thinking	55.0%	Yes
Claude Opus 4.5	49.5%	Yes
This architecture	46.87%	No
Gemini 3.1 Pro	46.8%	Yes
Gemini 3 Flash	45.0%	Yes
GPT-5 minimal-think	39.1%	Yes
DeepSeek-R1	36.4%	Yes

Between GPT-5 and Claude Opus 4.5. Without a neural network.

The same operations that learned to survive in a game also replicate landmark scientific discoveries.

5/5 foundational findings from Swanson's 1986–1996 research — the work that founded computational drug repurposing. Zero false positives across 11 drugs. Real associations rank above false ones 94% of the time. End-to-end: raw PubMed text in, known scientific discovery out. No LLM. 4.7 seconds.

Substrate	What it does
Crafter	46.87% BALROG, 19/22 achievements, multi-step crafting from raw data
BabyAI	86% success with zero training. RL needs millions of frames.
PubMed	5/5 Swanson, calibration across 11 drugs
Python codebases	Structure discovery without modification
Math	Generates identities from axioms that weren't specified as targets
Social modeling	Infers other agents' goals from observed behavior across 2 substrate types

Same operations. Different data. No retraining.

Can't hallucinate. By construction. Every output traces to observed data. It surfaces what exists or reports nothing.

Edits its own source code to improve — rolls back automatically if a change makes things worse.

Runs on an RTX 4060. No cloud. The speed advantage over LLM agents isn't optimization — it's not needing a neural network.

What it doesn't do yet.

3 of 22 Crafter achievements still at 0%. The hardest ones need 4–5 prerequisites simultaneously. Sequential is solved. Conjunctive isn't. Working on it.

Tested against one 9B model directly. Frontier models compared via published BALROG scores. Medical discoveries are all previously documented — finding something humans genuinely missed is still open. Can't generate natural language yet.

← back to tomek