Obstacles to widespread commercial adoption of sparsity in the ML industry?

It’s also done more recently here:

2 Likes

Ah yes, I remember it well. Made a big impression at the time. Where is Marvin when you really need him?

1 Like

That’s a pretty big misunderstanding right there. I can easily concat both representations which can be perceived as 3 by another ANN. (This is not going into what numbers are exactly and how they arise from nothingness yadda yadda),

I can’t agree at all. An ANN is a recogniser (not a representation). An ANN that recognises ‘1’ will do so for one kind of presentation of ‘1’ – perhaps as an image or text or a spoken sound – but no others. Combining recognisers may produce a new recogniser for some blend or combination of the original recognised items.

An SDR is (in theory) a representation of something (that can be recognised). A feasible design for a brain might use ANNs to recognise things, outputting them as SDRs. The thing that interests me is the next step: how to take the SDRs (recognised) for 1, 2 and plus and combine them to output an SDR for 3.

In humans (and apparently other critters including bees) you recognize spatial groupings for small integers. Think dice.

Combinations are not concatenations or addition but instead, remembered mappings of combinations.
Much of simple mental arithmetic shares this mechanism.

With speech, you add naming and new tricks that come with the related symbolic representation.

2 Likes

Agreed with you here. SDR is just a specification for a representation format, but says nothing about the algorithm for transforming percepts and other information meaningfully into such a format.

Sure, in theory you can check for the union of concepts by unioning SDRs, but in practice you need to either learn or hardcode a mapping from percepts to bitvectors that are meaningfully distributed (nearby bits denote related concepts) and compositional (important separable degrees of variation are coded by separate bits). That is the problem ANNs and representation learning in general are aimed at tackling.

3 Likes

I think even the FaceBook book seems to be at the statistical mechanics level.
I think you have to go to an even lower level than that to start removing all the assuming, so to say.
Also if you have tried interpolation in higher dimensional space with say exertions of Chebyshev polynomials then the space is too big to get any kind of good result.
It would appear to me that the weighted sum is about the only thing simple enough to interpolate in any meaningful way in higher dimensional space. Contrary to the expectation that you could deal with that issue by throwing complexity at it.

Just for example if you use the weighted sum to store 3 vector scalar associations <vector, scalar> then you will get smooth interpolation as you move between any two stored vector keys.
And actually the geometry of those stored associations is quite interesting in a number of ways. Some of the details only become apparent when you use multiple <vector, scalar> associations to store <vector, vector> associations.
The assumption is though that all of this has been looked into by people in the past and nothing to report. But what if following that chain of assumption back decades, no one had done the scientific methodology at all?

It seems dropout is a widely adopted ML technique that has similar consequences to sparsity of nodes and dropconnect does something similar for connections.

Anything much more complex than the weighted sum in higher dimensions only allows very local interpolation. Of course there are basic issues with something so simple as the the weighted sum, however the introduction of even a small amount of prior nonlinear behavior tends to solve that. Even imaging sensor nonlinear behavior would do.
Of course you would like to be able to switch weighted sums for different contexts to allow more sophistication. I said some things before about ReLU neural networks.

It’s useful for small models, to be specific. Larger models don’t use dropout/dropconnect at all and it doesn’t really help at all in some scenarios. It’s more of a coincidence that dropout resembles sparsity (and works well) but its mostly an old concept now, what with the new era of transfer learning and pre-training eliminating its needs…

Dropout kind of looked like adding Gaussian noise to the input.
Which is a consequence of the Central Limit Theorem applied to the weighted sum.

As an outsider on all this I have the unfortunate feeling that artificial neural network researchers are insensible to the foundational aspects of what they are doing.
Progress may still be made under such circumstances, of an inferior sort.

I suppose someone will elaborate the fundamental insights in an exceptionally belated book, eventually, sometime in the late future.
I don’t expect it to be the near future, from what I’ve seen.

I think it’s much less being insensitive to that, and much more that the foundational theories that predict how DNNs behave hasn’t been discovered yet, so empiricism is the only yardstick we have.

1 Like

It seems to be commonly used in relatively large GNN. Maybe you are one of the obstacles the OP is referring to :wink:

1 Like

The architecture that is used for the largest DL models in history, the transformer, incorporates dropout.
It’s one of the most oldest DL regularization techniques but proven simple and effective enough to be used even to this day, regardless of the size of the model.

Speaking of transformers, the attention matrices end up consisting mostly very close to zero and only few significantly large entries. Consequently, an output unit is affected by only a small subset of input units, which is essentially a form of sparsity.

3 Likes

Dropout is sometimes used in transformers, but these days it isn’t viewed as necessary. For example, Google’s new PaLM transformer (540B parameters) was trained without dropout.

You’re 100% right about the attention matrix being sparse, though.

2 Likes

I’d assume no one has considered dropout as “necessary” as it’s just a way of improving generalization with a limited amount of data, as opposed to HTM’s sparsity that’s crucial to getting it to work at all.
Regarding the sparsity of an attention matrix, I’d argue it’s necessary to some degree for the transformer to work, as dense/homogeneous attention matrices would lead to regressing back to just an MLP with weight sharing and simple aggregation like MLP-mixers or PointNets do, which are proven to be much less powerful than transformers.

4 Likes

I feel it’s more correct to say that Attention heads in transformers automatically learn the required amount of sparsity implicitly. There was a paper recently that showed that even without explicit positional encodings, the model was able to learn them approximately enough to be used, approaching its peers.

Its this idea of learning everything implictly which I find attractive in Deep Learning. IMO, implicit is better. explicit goes down into GOFAI domain of hardcoding domain-knowledge is 0 generalization. So I’d like everything in the algorithm to be implicit, and learn to meta-learn :stuck_out_tongue_winking_eye:

Regarding PaLM which @cfoster0 linked above illustrates my point. Without being explictly taught to do any of these tasks, given a couple of examples (2-shot) its able to explain complex jokes, recall and manipulating concepts to a certain degree:-


Its impressive that a system is able to learn to do this implicitly, understanding and bridging decently complicated concepts on its own - while still maintaining generalizability across dozens of other tasks all at once. Again, I find this very attractive and feel it to be much more useful in real-world situations than narrow/specific models.

So if sparsity is indeed key, well the model should learn it itself :slight_smile: and if that’s the case consistently where such sparse models outperform dense ones, well I suppose one would utilize those models then. Its already a case with MoEs AFAIK, which were shown to be more scalable than dense counterparts but I’d need double-checking on that.

Large in the context of DL means multi-billion parameter models, like ViTs and GPT3/LLMs - which certainly do not use Dropout at all.

The reason is that such large models are already under-trained (except a 70B one from DeepMind) and Dropout empirically doesn’t really help performance at all, so using it is pointless.

1 Like

SDRs as a way to use content-specific memory, I thought was an important point. That often too infers non-VN architectures for speed.
An interesting recent paper on such compute-memory distinctions:

and obviously anything by Kanerva in the last 10 years on Hyper Dimensional Computing also cover much of this.

1 Like

Reductionism always feels like this and is never a good argument. why shouldn’t I torture your entire family, if they’re just a bunch of chemicals in a bag, delusional they are alive and conscious? that they’re just a bunch of cells? A collection of atoms? :thinking:

You can break down almost anything like this - but the truth is, its these interactions between simple concepts that yields the most complicated behaviour. As a reminder, Conway’s Game of Life is Turing complete.

The difference is that these simple matrix multiplications are able to show more and more sophisticated intellectual behaviour. Explaining jokes, is not a joke. explaining that it understands the wordplay, the nuance of jokes and is able to manipulate and interpret concepts is not a small deal.

So far, there hasn’t been a single system that can do multiple tasks, explain jokes, answer questions, complete stories/documents/text, solve basic arithmetic, reason to a limit and literally hundreds more - all in a single package.

So yes, in essence, its nothing more than a few matmul’s. but its the behaviour that it leads to, that puzzles and astounds even the best minds; and why its such an intriguing field to research in…

1 Like