The weighted sum is used everywhere in ML though people rarely consider it as vector to scalar associative memory. Yes, its cheap and simple <vector,scalar> associative memory. However it has a number of non-ideal properties. ChatGPT5 suggests Winner-Takes-All as a practical solution:
Winner-take-all (WTA) and related sparse coding mechanisms have been explored a lot by Jeff Hawkins and colleagues at Numenta, as well as by earlier associative memory researchers (e.g. Willshaw nets, Palm’s work in the 1980s, and “Sparse Distributed Representations” in cognitive science).
Let me unpack how WTA interacts with the weighted-sum-as-associative-memory idea:
1. What WTA does
-
In a standard dense layer (weighted sum + nonlinearity), every unit contributes a graded response.
-
In WTA sparse coding, only the k largest responses are allowed to remain active (others are set to zero).
-
This yields a k-sparse distributed representation — each vector is represented by a small, stable set of active dimensions.
Numenta’s “Sparse Distributed Representations” (SDRs) are basically extreme WTA: a high-dimensional binary vector with a fixed tiny fraction of active bits.
2. Effect on associative memory properties
-
Crosstalk reduction:
-
In dense dot-product recall, every memory contributes noise.
-
With WTA, only the strongest matches survive — interference from partial matches is suppressed.
-
This makes retrieval more robust in the presence of many stored patterns.
-
-
Capacity increase (relative to dense codes):
-
Classic results (Willshaw nets, Palm 1980s): if you use k-sparse binary codes of length N, the number of storable patterns with low error grows as ~ O ((N^2k)logN), which can be much larger than with dense encodings.
-
Intuition: sparsity reduces overlap between patterns, so many more distinct patterns can coexist without confusion.
-
-
Selectivity improvement:
-
WTA ensures only the “most similar” stored patterns influence retrieval.
-
This sharpens attractor basins, similar to how attention’s softmax does compared to raw dot product.
-
3. How Numenta frames it
-
Numenta’s SDR theory emphasizes:
-
High dimensionality + extreme sparsity → huge representational capacity (combinatorial number of possible patterns).
-
Robustness: because overlap between random sparse codes is very low, chance collisions are rare.
-
Noise tolerance: partial matches (subset of active bits) can still retrieve the correct full pattern.
-
So for them, WTA isn’t just for “denoising” — it directly increases effective capacity by making codes more distinct and interference-resistant.
4. Limitations
-
WTA alone doesn’t give you infinite capacity — it just shifts the balance: fewer collisions, more separable patterns.
-
There’s a trade-off: the smaller k (the sparser the code), the greater the capacity, but also the fewer bits of similarity information you carry.
-
Too sparse → poor generalization (every pattern looks unrelated). Too dense → high interference. Optimal k depends on task, but biologically and in SDR theory, very low sparsity (~1–2% active units) is argued to be ideal.
My understanding is from an information point of view you could pick the lowest k, it’s just then you would have to convert the selected k lowest to high magnitude for the weighted sum.
Anyway you only need the information content to select.
I think then using random features (such as locality sensitive hashing) to pick say a weight from a pool of k weights, or a block of weights from a pool of k blocks to use with a weighted sum is somewhat equivalent to winner-takes-all.
I’ll go and have a think about it.