Weight-sparse transformers have interpretable circuits [pdf]

cdn.openai.com

68 points by 0x79de 8 days ago

lambdaone 6 hours ago

I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?

calebh 5 hours ago

Yes, you can relax logic gates into continuous versions which makes the system differentiable. An AND gate can be constructed with the function x*y and NOT by 1-x (on inputs in the range [0,1]. From there you can construct a NAND gate, which is universal and can be used to construct all other gates. Sigmoid can be used to squash the inputs into [0,1] if necessary.
This paper lists out all 16 possible logic gates in Table 1 if you're interested in this sort of thing: https://arxiv.org/abs/2210.08277
leogao an hour ago

There's been some work (e.g RASP - https://arxiv.org/abs/2106.06981) on taking logical computations and compiling them into transformer weights.
esafak 2 hours ago

https://en.wikipedia.org/wiki/Probabilistic_logic
More generally, machine learning is all about dealing with imprecision, including logic.
smokel 4 hours ago

Do you mean fuzzy logic [1]? It was all the hype in the 1990s.
[1] https://en.wikipedia.org/wiki/Fuzzy_logic
radarsat1 6 hours ago

> fuzzifying logical computation?
Isn't that basically what the sigmoid operator does? Or more in the direction of averaging many logical computations, we have random forests.

oli5679 6 hours ago

This ties directly into the superposition theory.

It is believed dense models cram many features into shared weights, making circuits hard to interpret.

Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.

HarHarVeryFunny 4 hours ago

Yes, although the sparsity doesn't need to be inherent to the model - another approach is to try to decode the learned weights using approaches like sparse auto-encoders or transcoders.
https://transformer-circuits.pub/2025/attribution-graphs/met...
- leogao an hour ago
  
  I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.
  See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093

m_ke 4 hours ago

We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.

leogao 2 hours ago

For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.
- m_ke an hour ago
  
  yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.
  Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.
- esafak 2 hours ago
  
  As the lead author, why do you think so?
  - leogao 33 minutes ago
    
    I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:
    - Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).
    - Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.
    It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).
    To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.
yvdriess 3 hours ago

Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.
carterschonwald 3 hours ago

There also needs to be tools that can author that code!
Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.
Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail
p1esk 3 hours ago

Deep Learning models would work way better with much higher dimensional sparse vectors
Citations?
- yvdriess 3 hours ago
  
  There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.
  The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.
  - noosphr 2 hours ago
    
    If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.
  - p1esk 3 hours ago
    
    I could not find any evidence that sparse models work better than dense models.
    
    tripplyons 2 hours ago
    
    All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.
    Examples: - GPT OSS 120b - Kimi K2 - DeepSeek R1
    
    leogao 2 hours ago
    
    Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
    
    tripplyons 2 hours ago
    
    Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
    For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
    Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
    
    p1esk an hour ago
    
    “Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.
    Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.
    
    m_ke 2 hours ago
    
    https://transformer-circuits.pub/2022/toy_model/index.html
    https://arxiv.org/abs/1803.03635
    EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:
    To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).
    Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).
    ### 1\. The Representation: Hyperdimensional Computing (HDC)
    Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters. To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.
    * **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories. * **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations: * **Binding (Association):** XOR operations (`A ⊕ B`). * **Bundling (Superposition):** Majority rule (voting). * **Permutation:** Bit shifting. * **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
    ### 2\. The Architecture: "Spiking" Attention Mechanisms
    Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.
    * **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection. * **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function. * *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
    ### 3\. The Hardware: Neuromorphic Substrate
    Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).
    * **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address. * **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
    ### Summary: The Hypothetical "Spiking HD-Transformer"
    | Feature | Standard Transformer | Simulated "Brain-Like" Transformer | | :--- | :--- | :--- | | *Dimension* | Low (\~4k), Dense, Float32 | *Ultra-High* (\~100k), Sparse, Binary | | *Operation* | Matrix Multiplication (MACs) | *Bitwise XOR / Popcount* | | *Attention* | Global Softmax ($N^2$) | *Spiking k-Winner-Take-All* (Linear) | | *Activation* | Continuous (RELU/GELU) | *Discrete Spikes* (Fire-or-Silence) | | *Hardware* | GPU (Synchronous) | *Neuromorphic* (Asynchronous) |
    
    p1esk an hour ago
    
    I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.
    Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.
    
    m_ke 15 minutes ago
    
    Before we had proper GPUs everyone said the same thing about Neural Networks.
    Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
    There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
    Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
    Plenty of mech interp work shows that models are forced to commingle different concepts to fit them into the "low" dimensional vector space. (https://www.neelnanda.io/mechanistic-interpretability/glossa...)
    https://arxiv.org/abs/2210.06313
    https://arxiv.org/abs/2305.01610

edvardas 4 hours ago

HTML version: https://arxiv.org/html/2511.13653v1

peter_d_sherman 8 hours ago

>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.

Sparse circuits are defined as a set of nodes connected by edges."

...which could also be considered/viewed as Graphs...

(Then from earlier in the paper):

>"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.

And (jumping around a bit more in the paper):

>"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."

A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)

Anyway, great paper!