Winner Takes All and the Weighed Sum

The weighted sum is used everywhere in ML though people rarely consider it as vector to scalar associative memory. Yes, its cheap and simple <vector,scalar> associative memory. However it has a number of non-ideal properties. ChatGPT5 suggests Winner-Takes-All as a practical solution:

Winner-take-all (WTA) and related sparse coding mechanisms have been explored a lot by Jeff Hawkins and colleagues at Numenta, as well as by earlier associative memory researchers (e.g. Willshaw nets, Palm’s work in the 1980s, and “Sparse Distributed Representations” in cognitive science).

Let me unpack how WTA interacts with the weighted-sum-as-associative-memory idea:


1. What WTA does

  • In a standard dense layer (weighted sum + nonlinearity), every unit contributes a graded response.

  • In WTA sparse coding, only the k largest responses are allowed to remain active (others are set to zero).

  • This yields a k-sparse distributed representation — each vector is represented by a small, stable set of active dimensions.

Numenta’s “Sparse Distributed Representations” (SDRs) are basically extreme WTA: a high-dimensional binary vector with a fixed tiny fraction of active bits.


2. Effect on associative memory properties

  • Crosstalk reduction:

    • In dense dot-product recall, every memory contributes noise.

    • With WTA, only the strongest matches survive — interference from partial matches is suppressed.

    • This makes retrieval more robust in the presence of many stored patterns.

  • Capacity increase (relative to dense codes):

    • Classic results (Willshaw nets, Palm 1980s): if you use k-sparse binary codes of length N, the number of storable patterns with low error grows as ~ O ⁣((N^2k)log⁡N), which can be much larger than with dense encodings.

    • Intuition: sparsity reduces overlap between patterns, so many more distinct patterns can coexist without confusion.

  • Selectivity improvement:

    • WTA ensures only the “most similar” stored patterns influence retrieval.

    • This sharpens attractor basins, similar to how attention’s softmax does compared to raw dot product.


3. How Numenta frames it

  • Numenta’s SDR theory emphasizes:

    • High dimensionality + extreme sparsity → huge representational capacity (combinatorial number of possible patterns).

    • Robustness: because overlap between random sparse codes is very low, chance collisions are rare.

    • Noise tolerance: partial matches (subset of active bits) can still retrieve the correct full pattern.

So for them, WTA isn’t just for “denoising” — it directly increases effective capacity by making codes more distinct and interference-resistant.


4. Limitations

  • WTA alone doesn’t give you infinite capacity — it just shifts the balance: fewer collisions, more separable patterns.

  • There’s a trade-off: the smaller k (the sparser the code), the greater the capacity, but also the fewer bits of similarity information you carry.

  • Too sparse → poor generalization (every pattern looks unrelated). Too dense → high interference. Optimal k depends on task, but biologically and in SDR theory, very low sparsity (~1–2% active units) is argued to be ideal.


My understanding is from an information point of view you could pick the lowest k, it’s just then you would have to convert the selected k lowest to high magnitude for the weighted sum.

Anyway you only need the information content to select.

I think then using random features (such as locality sensitive hashing) to pick say a weight from a pool of k weights, or a block of weights from a pool of k blocks to use with a weighted sum is somewhat equivalent to winner-takes-all.
I’ll go and have a think about it.

1 Like

ChatGPT5 suggests winner-take-all sparsity with random feature based weight selection may combine nicely:

1. WTA as a kind of “learned hash”

A dense weighted sum is a continuous similarity measure (dot product).

WTA turns it into a discrete hash: only the top-k components survive, effectively assigning the input vector to a “bucket” defined by which indices won.

This is almost identical to Winner-Take-All Hashing (WTALSH) in the approximate nearest-neighbor (ANN) literature: you keep the indices of the largest coordinates after projection as a hash key.

So WTA can be viewed as a locality-sensitive hashing scheme, where the similarity metric is “overlap of winners.”

2. Random feature weight selection (LSH-style)

In random features for kernels, you normally pick random weights (Fourier features, etc.) and compute dot products.

In LSH, you often sample random projections or select subsets of coordinates (pool/block sampling).

Per-weight pool: Each feature weight is chosen from a small pool of random candidates → encourages diversity, lower compute.

Block-wise pools: Partition dimensions into blocks, then pick one (or a few) weights per block → balances coverage and sparsity.

These are effectively structured sparsifications of the weight matrix.

3. How WTA and random selection interact

Now imagine combining the two:

Step 1: Random feature generation. Instead of a full dense projection, you generate candidate responses via a small pool of random weights (per dimension, or per block).

Step 2: Winner-take-all selection. From these candidate responses, you only keep the top-k (globally or per block).

This gives you a double filter:

randomness ensures coverage and diversity (like LSH),

WTA ensures sparsity and interference reduction (like SDRs/associative memory).

4. Theoretical perspective

Capacity / crosstalk: WTA reduces interference between features by making codes sparse (like Willshaw capacity scaling). Randomness ensures different samples don’t collapse into the same indices systematically. Together you get high capacity, low collision.

Information-theoretic view: randomization alone spreads information, but doesn’t control overlap well; WTA alone reduces overlap but may “waste” dimensions if winners are always similar. Combining them maximizes entropy of which features fire → closer to optimal coding.

Connection to compressive sensing: picking random weights is like designing a random measurement matrix; WTA is like enforcing sparsity in the measurement vector. Jointly, this is compressive sensing’s sweet spot: recover sparse signals from few random projections.

5. Practical precedents

You’re not inventing in a vacuum — there are lines of work along these lines:

Winner-Take-All Hashing (WTALSH, Yagnik et al. 2011): for ANN search, exactly combines random projections + WTA to create sparse hash codes.

K-Winner activations (Numenta, 2010s): applied on top of random encoders to create sparse distributed representations.

Randomized k-sparse autoencoders: restrict activations to k winners per block of random features for efficient representations.

Product quantization / block hashing: partition weights into blocks, then apply local competition.

6. Big picture

Random weight selection = diversity / coverage.

WTA = sparsity / robustness.

Combination = structured sparse high-capacity associative hash.

So in a sense, you get the best of both:

lower compute than full dense features,

higher effective capacity than random features alone,

more robust to interference than WTA alone.

✅ Summary:
WTA and random feature selection are two sides of locality-sensitive hashing. Random selection diversifies, WTA sparsifies. Combining them yields representations with both high capacity and robustness, closely related to WTALSH, sparse distributed representations, and compressive sensing.  
1 Like

You would think that winner-take-all pre-selection into a weighted sum, random feature locality sensitive hashing weight selection from a pool, double descent, noise sensitivity would just be a couple of chapters in a book you could look up.

It should be a worn out topic not something you have to piece together from papers and ChatGPT yourself. After all, it is only the weighted sum.

Anyway I looked a little at block weight selection vs single weight selection using random features. Block weight selection can be like having associative memory work on part of the problem as a group, single weight selection is less coherent.

That’s nice from a few viewpoints, like CPU memory and caching and probably generalization. Negative in that you have to choose a hyper-parameter (block size). Anyway I can only take things so far as a hobbyist. And not very far at all. I would prefer to just do vibe coding to create games at this stage but there are no 2 chapters of a book to read and just have done with it.

1 Like

I also coded WTA with a simple extreme learning machine. I didn’t check yet for increased capacity, sharper decision boundaries etc.

I’ll get gpt to run through this “golden oldie” paper and see what extra viewpoints I can get on it: https://arxiv.org/abs/1903.11257

1 Like

I wouldn’t bother you with this except there seems to be a biological insight. (I’ll leave the raw latex math)

If you were to create an extreme learning machine using WTA in the read-out layer it sounds like the random layer should have rather non-linear properties to get the best out of WTA.

ChatGPT said:

That’s a really sharp observation :+1: — you’re right that if we bolt winner-take-all (WTA) sparsity onto the readout layer of an Extreme Learning Machine (ELM), then the properties of the random hidden layer matter a lot. Let’s unpack why:


1. Quick Refresher: Extreme Learning Machine (ELM)

  • Standard ELM:
    Input → Random projection (random weights, fixed) → Nonlinearity → Linear readout (learned).

  • Training = only solve a linear regression on the readout layer.

So the power of ELMs comes from the richness of the random hidden layer embedding.


2. Adding WTA to the Readout

Suppose instead of a dense linear readout we have:

y=∑i∈Top-kwihiy = \sum_{i \in \text{Top-}k} w_i h_iy=i∈Top-k∑​wi​hi​

where hih_ihi​ are the hidden activations from the random feature map.

  • Now, only the top-k hidden features (per input) contribute to prediction.

  • That means the random feature layer has to be expressive enough so that:

    1. The strongest activations actually correspond to informative features.

    2. The ranking of features (not just their magnitudes) is meaningful.


3. Why Nonlinear Random Features Help WTA

If the hidden layer is linear random projections (say, Gaussian matrix times input), then all hidden activations are just mixtures of the input. The “top-k” may not strongly discriminate between classes, because Gaussian projections distribute signal fairly evenly.

But if the hidden layer has nonlinearities, we get:

  • Spiky, heavy-tailed distributions:
    Nonlinear activations like ReLU, tanh, or even polynomial expansions create hidden units that fire strongly for some regions and stay quiet elsewhere.
    → This makes the ordering of activations meaningful for WTA.

  • Localized receptive fields:
    Nonlinear mappings (e.g. random Fourier features, kernels, radial basis units) create “local detectors” that only respond strongly to certain input patterns.
    → WTA can then pick out the few detectors that fire most strongly, giving a sparse, semantically meaningful code.

  • Increased separability:
    Nonlinear random features expand the input manifold into a higher-dimensional, more separable space. WTA then acts as a noise filter, keeping only the sharpest discriminators.


4. Analogy to Biological Coding

This is very reminiscent of cortical coding:

  • Random projections (like random receptive fields in V1).

  • Nonlinear tuning (neurons fire sparsely and selectively).

  • WTA competition in local circuits → only the strongest few neurons survive → sparse, robust code.

So the combination “random nonlinear features + WTA readout” is biologically plausible and computationally powerful.


5. Generalization Implications

  • Without nonlinearity: WTA is mostly selecting random large projections — not much structure, can even hurt generalization.

  • With nonlinearity: WTA highlights the most discriminative hidden features, acting like a form of structured dropout/regularization, which can improve generalization.

There are parallels to k-Winner activation functions in deep nets, where performance improves precisely because the hidden activations are nonlinear and diverse.


:white_check_mark: Bottom line:
If you want to build an ELM with WTA readout, you should use a highly nonlinear random feature map (e.g. random ReLU layer, random Fourier/RBF features, Chebyshev expansions, etc.). That way, WTA will be selecting truly salient detectors instead of just noise-spiked linear projections.

1 Like

How does WTA mitigate the curse of dimensionality?

2 Likes

It selects out a limited number of weights in the model to use for a particular input and that stops memories mixing together in unhelpful ways.

Maybe that is the main use because I couldn’t really get much memory storage capacity increase out of it when I tried it.

However I only tried for one evening with very different systems to the papers.

1 Like

There are formulations “i think” for the number of data that associative memory can store. Of course there are factors like the correlation of the data, the associative memory method, etc.

2 Likes

There is also the question of how well associative memory can generalize when presented with test data outside the training set.

This paper gives a test you can do:

https://arxiv.org/abs/2106.13799

I ran the paper through chatGPT5 to get a quick summary and learn exactly how to do the test.

You need the training data, test data and unlabeled test data. The last being a rather unusual requirement.

1 Like

:joy: unusual indeed

1 Like