Distributed Representations (1984) by Hinton

What do you have to say about this report Distributed Representations (1984) by Hinton (2019 Turing Award)?

See also https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf.

I noticed that Kanerva’s Sparse distributed memory is referenced in Properties of Sparse Distributed Representations and their Application to Hierarchical Temporal Memory, though.

1 Like

Prophetic. Still a good read.


We’ve talked a lot about this stuff recently in research meetings. Watch:


I read the PDP books in the 1980’s and they still stand as one of the necessary foundation references in this field. You pointed to chapter 3 and a related paper. Both end with summaries of the “hard problems” that have to be solved to make effective use of distributed representations.

In the 1984 Hinton paper we find:
There are several very hard problems that must be solved before distributed representations can be used effectively. One is to decide on the pattern of activity that is to be used for representing an item. The similarities between the chosen pattern and other existing patterns will determine the kinds of generalization and interference that occur. The search for good patterns to use is equivalent to the search for the underlying regularities of the domain. This learning problem is not addressed here.

I won’t put words in their mouth but it seems that Numenta is addressing this representation issue by building from the bottom up.

I am working from the foundation offered by Calvin and diving right in at the association areas using my hex-grid coding scheme. I did not get here right away and if you had described this scheme to me when I was reading the PDP book I would have been bewildered - there are conceptual layers that have to be traversed to see how the concepts fit together.

The conceptual seeds of of deep learning were clearly described in the PDP book but it took decades to see the concepts come to full flower. You may not like point neurons but you can’t denigrate the power of these tools. The language translation applications have been one of the “Holy Grail” targets of the machine learning communities from the beginning and while the current apps miss “the meaning” of the text, the utility of these tools have become indispensable in my job. I expect good things to come as Parsey McParseface filters into chatbot applications.

The distributed representation concept is just now reaching a stage of maturity where progress will start to accelerate. With large memory, fast processing units, and some interesting theoretical architectures, the seeds for the next step in implementing the PDP book ideas have been planted.

Get ready for a wild ride!


Or do the trio have the wrong basic handle on neural networks and AI in general?
That would be very embarrassing. Failure to understand the filtering/CLT aspects of neural networks. That non-linear behavior introduces high frequency components that are repeatedly smoothed down again and other basic viewpoints.
And while the weighted sum has many nice properties as a linear associative memory it also has terrible flaws in the form of additive/subtractive spurious responses.
An all round failure to look at/calculate/instrument the behavior of the basic elements.
It’s likely there won’t be another AI winter, but there may need to be a restart. Especially getting a proper scientific engineering mathematical grip on things right from the very basics.

The AI winter(s?) was due to a profound failure to deliver anything useful for all the money poured into research. This go-round has delivered the goods and there is still a positive correlation between research dollars spent vs. results. We have ignition; as long as this continues progress will continue. There is now enough room in the arena for alternative lines of research to flourish without endangering the “bread and butter” applications that are paying the bills.

1 Like

Distributed representations are all the rage now, but I think it’s gone too far. On the opposite, what we need is more localization, as in “encapsulation”: CapsNets or other versions of cortical columns.

1 Like

Ignition. :grinning:
The (weighed) sums in a fully connected layer in a neural network take n squared operations. It seems to me all the sums can be replaced by a Walsh Hadamard transform using n log_base_2(n) operations. Anyway it is working out in code I am trying at the moment.
Stephen Hawking went through relativity again right from the very basics - justifying, verifying and probing every aspect. That is missing from neural network research at the moment. Instead researchers are immobilized by a hand-me-down diagrammatic model from the 1950s or 60s.

I see using emojis gets the spell checker to work again on my particular browser, very useful.

As far as I am aware, the sensory streams maintain topology through to the association regions. We have discussed “crazy quilting” as the gross parcellation of topological maps but this is distinctly different from a convergence model. There is no “convergence” as the sensations flow from map to map, and the lateral reach of the components of the various layers is mostly very local. I think that the only way you could describe this is as a distributed representation.

As you can see from the “crazy Quilting” thread I have been exploring this concept with an open mind and have come back to the distributed representation as best fitting the papers I looked at.

I invite you to offer a defense of the cortical hierarchical convergence model with any known biological observations. I don’t consider this as a “done deal” and am very willing to be shown support for a “spreading of representation” model in the actual biology.

I won’t argue that it is not possible to construct such a system - only that is not observed in the biology.

Ultimate representations are definitely distributed in the brain, the only way to localize them is in local memory, separated from processing. Which neurons simply don’t have, but we don’t need to emulate this handicap. Encapsulation in the brain is probably similar to localization in your mapping:

only I think this mapping may be many-to-many, not just one map to another map. But hierarchy is a separate issue. I didn’t go through the whole thread, but it seems to focus on lower areas: V1-to-V2. Organization doesn’t have to be hierarchical everywhere, hierarchical mappings interspersed with and on top of modality-specific lateral mappings would make the whole cortex hierarchical.

Well, how do you think this crazy quilting emerged? It’s not really about “spreading of representation” so much as spreading search: learning that formed those quilted mappings, I don’t mean learning in the last hour, but also during early development, if not evolution. Mappings that spread far from input areas must’ve been learned by long-range search, hence hierarchical spreading. The problem is that search must locally restricted at any given step of this expansion, else you get combinatorial explosion. But most of this macro-structural learning is over by end of adolescence, what we see in the brain is fossilized results.

I think convergence here also mediates divergence: broad concepts contain multiple levels of detail, which would also be transferred from lower to higher areas. I don’t think there is any dispute that higher areas, especially dlPFC and inferior parietal cortex, represent such broad concepts. And they can only be discovered by search across accumulated experience, even if this search is continuously pruned or localised. Going further, I think default mode network is hierarchically higher, in generality of it’s contents, than task-positive network. But it’s pretty hard to study.


In digital signal processing terms a summing operation is a particular type of filter, that accepts some inputs and rejects many others. And obviously the central limit theorem (CLT) strongly applies producing a low variance output from high variance inputs.
It is sometimes overlooked that the CLT applies not only to sums but also to differences and mixtures of sums and differences.
The Walsh Hadamard transform is a collection of filters based on sums and differences and the CLT applies to each of its outputs as strongly as simple summing. I can see that CLT smoothing effect very clearly with the neural networks I have where the WHT replaces all the summing operations in a neural network layer.
Each filter of the WHT accepts some signals and rejects others.
If you replace all the summing operators in a neural network layer with a WHT you have no means of incorporating the n squared weight parameters. However as noted before a number of viewpoints suggest that conventional neural networks use the weight parameters very inefficiently.
The thing to do then is use only n weights or actually n parameterized non-linear functions as the input to the WHT. And those non-linear functions you have to pick carefully in terms of trainability and information flow/loss etc. Also you may choose to make the network a bit wider than usual eg. 2*W.

I guess you have a choice between using fixed non-linear functions and adjusting filters (weighted sums) to get the wanted responses as in a conventional artificial neural network or you can used fixed filter functions (eg. WHT, FFT) and adjust the non-linear functions by parameterizing them to get the wanted responses.
Top researchers are heavily invested (career/status/pride) in the conventional arrangement. And if left to them it would be 30 years before they could cede anything. However there are big commercial pressures in terms of big electricity bills, big hardware bills and big competition that may cause a more rapid reevaluation if any improvements can be obtained by out of the box thinking.