Eigenhelm: Measuring the Shape of Good Code

What Good Code Has That AI Code Often Doesn't

If you use AI coding tools long enough, you start to recognize a certain smell.

The code passes tests. It satisfies the linter. It may even look tidy at first glance. But it has a sameness to it. Too much repetition. Too little shape. Functions that feel assembled rather than designed. You can often tell the code works before you can say why it feels wrong.

That is the gap Eigenhelm is trying to measure.

The premise is simple. A lot of what experienced engineers call code quality is not local and syntactic. It is global and structural. Linters are good at catching unused imports, missing types, or style drift. They are not built to detect boilerplate gravity, flat abstractions, or the odd structural uniformity that AI code often has.

So the question behind Eigenhelm was straightforward: can that broader sense of quality be measured as distance from a learned distribution of strong code?

How Eigenhelm Sees Structure

Eigenhelm begins by parsing source code with tree-sitter and extracting a 69-dimensional feature vector from each function or class.

Those dimensions come from three places.

Halstead complexity contributes 3 dimensions: Volume $V = N \cdot \log_2(\eta)$ , Difficulty $D = (\eta_1 / 2) \cdot (N_2 / \eta_2)$ , and Effort $E = D \cdot V$ . These are old metrics, but still useful for capturing vocabulary richness and cognitive load.

Cyclomatic complexity contributes 2 more: McCabe's $v(G)$ and a density term $v(G)/nloc$ so a short knot and a long knot do not look the same.

Weisfeiler-Leman AST hashing contributes the other 64. This is the unusual part. Each AST node is hashed by type, then iteratively re-labeled based on the labels of its children:

$label[v] \leftarrow blake2b(label[v] \parallel sort(label[c_1], ..., label[c_k]))$

After a few rounds, all node labels are binned into a 64-slot histogram and normalized into a probability distribution. What drops out is a structural fingerprint of the code's shape: branching, nesting, subtree diversity, repetition.

That matters because two pieces of code can solve the same problem and still have very different structural character. Deeply nested conditionals, flat boilerplate, and composed code each leave different traces in that fingerprint.

How It Learns the Shape of Strong Code

A raw 69-dimensional vector is not useful by itself. The dimensions live on different scales and mean different things. A Halstead volume and a histogram bin cannot be compared directly.

So Eigenhelm trains a PCA model on what we call an elite corpus: 1,000 to 3,000 functions per language drawn from strong open-source projects. Each feature vector is standardized, then projected into a principal subspace using the economy SVD:

$X = U\Sigma W^T$

The projection matrix $W$ defines an eigenspace for each language. New code is standardized against the training distribution, projected into that space, then scored along two axes:

$x_{norm} = (x - \mu) / \sigma$

$z = x_{norm} \cdot W$

$x_{rec} = z \cdot W^T$

Drift measures reconstruction error: $\lVert x_{rec} - x_{norm} \rVert_2$ . If drift is high, the code sits far from the subspace spanned by the training corpus. The model cannot explain it well.

Alignment measures the norm of the projected coordinates: $\lVert z \rVert_2$ . If alignment is high, the code may still lie inside the learned space, but it sits far from the center of what strong code usually looks like.

That distinction matters. Some code is strange because it lies off the manifold. Other code is strange because it lives at the far edge of it.

Why It Needs More Than PCA

PCA gives Eigenhelm two signals, but not the whole picture. The rest comes from information theory.

Byte entropy measures how much information is packed into the raw source text:

$H(P) = -\Sigma p_i \log_2(p_i)$

Code tends to live in a fairly narrow entropy band. When entropy drops too low, it often means repetition, template churn, or boilerplate drag.

Compression structure uses a Birkhoff-style measure adapted for source code:

$M_Z = clamp((N \cdot H - K) / (N \cdot H), 0, 1)$

Here $N$ is raw byte count, $H$ is entropy in bits per byte, and $K$ is compressed byte count. The implementation uses this as a heuristic rather than a dimensionally pure equation. What matters is the signal: if code compresses much more than its entropy suggests, there is usually a lot of repeated structure hiding in it.

NCD exemplar distance compares the code under review against exemplar functions from the training corpus:

$NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))$

This captures nonlinear structural similarity through compression. PCA tells you how far the code sits from a learned space. NCD tells you how much the code resembles known strong examples in a different way.

Each metric is normalized to $[0, 1]$ and combined into a weighted loss score:

Dimension	Weight (5-dim)	Weight (4-dim)	Weight (2-dim fallback)
Manifold drift	0.30	0.35	—
Manifold alignment	0.30	0.35	—
Byte entropy	0.15	0.15	0.50
Compression structure	0.15	0.15	0.50
NCD exemplar distance	0.10	—	—

The fallback matters. Even without a trained model, Eigenhelm can still say something useful about entropy and structural repetition.

Does It Hold Up

That is the real question.

We validated the scoring pipeline in two ways.

First, we measured whether it separates elite code from mid-tier code across five languages.

Language	Cohen's $d$	Verdict
Rust	1.000	Strong
JavaScript	0.966	Strong
Python	0.724	Medium
TypeScript	0.658	Medium
Go	0.516	Passes

All five clear the $d > 0.5$ bar for meaningful discrimination. That does not make the score perfect. It does mean the model is picking up a real difference.

Second, we compared Eigenhelm against human expert ratings on 92 code samples across five languages.

Overall: $\rho = 0.54$ , $p < 0.0001$
Python-only: $\rho = 0.66$

That is not a replacement for human judgment, and it is not meant to be. It is a triage signal. It tells you which code deserves a harder look.

What It Is And Isn't

Eigenhelm does not understand semantics. It cannot tell you whether a function is correct, whether an algorithm is optimal, or whether a name is misleading.

It is also not a linter. Linters enforce conventions. Eigenhelm measures deviation from a learned structural distribution.

The right mental model is closer to a densitometer than a reviewer. It measures a property. It does not explain the whole program.

That is exactly why it is useful. It lives in the gap between syntax and semantics, in the place where experienced engineers often say some code just feels off.

Where It Fits in the Loop

Eigenhelm runs as a CLI, pre-commit hook, GitHub Action, or HTTP API. It emits terminal output, JSON, and SARIF. Scores are classified as accept, marginal, or reject based on the training distribution.

For human reviewers, that gives a fast way to spot structurally suspicious code. For autonomous agents, it gives something even better: a concrete target. Instead of vague instructions to improve quality, the agent can see whether the problem is drift, alignment, entropy, compression structure, or exemplar distance.

That makes the feedback actionable.

The Bet

Eigenhelm is open source under AGPL-3.0 at github.com/metacogdev/eigenhelm. Trained models ship for Python, JavaScript, TypeScript, Go, and Rust, plus a polyglot model that covers all five.

The larger point is not that code quality can be reduced to one number. It cannot. The point is that there is a real class of structural signals that linters miss and experienced engineers still notice.

Eigenhelm tries to measure that class directly.

That is the bet. Not that taste can be automated away. Just that some of what good engineers notice can be made legible.

eigenhelm.sh