“Abundance of recently obtained datasets on brain structure (connectomics) and function (neuronal population activity) calls for a theoretical framework that can relate them to each other and to neural computation algorithms. In the conventional, so-called, reconstruction approach to neural computation, population activity is thought to represent the stimulus. Instead, we propose that similar stimuli are represented by similar population activity vectors. From this similarity alignment principle, we derive online algorithms that can account for both structural and functional observations. Our algorithms perform online clustering and manifold learning on large datasets.” Mitya Chklovskii
You can probably use this associative memory as a replacement for each neural layer:
https://github.com/S6Regen/Associative-Memory-and-Self-Organizing-Maps-Experiments
Speculating a bit you could try having 2 randomly initialized associative memories. For each input you gradually train them toward the average of the 2 recalled vectors. That might be enough to get clustering. Or maybe you have to do something a bit smarter that that to get the separation of classes you see in the video.
Perhaps feedback is doing that. Seems like aligning the learning on IT (or coupling columns in IT) is what you have there. It just happens that the similarity matrices are “aligned” because physically close L6 neurons in higher regions (that will fire with similar samples in the matrix) will condition which mini-columns downstream will learn.
Unfortunately the word feedback don’t appear a single time there… I’m starting to wonder if there is an overdose of math around all these computational neuroscience ideas (and perhaps math is no the right tool for the task).
If you could disentangle real world manifolds in such a way it would be a very powerful result worthy of a prize. I’ll consider the matter from the perspective of associative memory (AM). If you have 2 AM’s you would aim to get their outputs the same for each particular input. Each AM should presumably be randomly initialized. The exact mechanism for getting the AM’s to agree would likely be important. I don’t know if that would result in exactly what Chklovskii is saying.
Anyway there is a connection to feedback alignment which may also help separate manifolds:
https://medium.com/@rilut/neural-networks-without-backpropagation-direct-feedback-alignment-30d5d4848f5
Similarity alignment sounds more principled though. Also you will be able to buy 100Tbyte SSD drives for associative memory:
https://www.theverge.com/circuitbreaker/2018/3/19/17140332/worlds-largest-ssd-nimbus-data-exadrive-dc100-100tb
Extreme learning machines=associative memory (=reservoir computing):
https://www.quora.com/What-are-extreme-learning-machines
Is the last one somehow related to the lecture by Mitya Chklovskii?
I don’t think no. This is just a bunch of chips based on +64-layer QLC NAND cells. FTL and endurance in those disks will be hard to handle With NAND based devices is imposible to do anything fancy bellow Block/Page level (a few KB).
The extreme learning thing isn’t part of anything Chklovskii mentioned. I just noted that either with similarity alignment or feedback alignment you can replace a single neural network layer with an extreme learning machine (=associative memory ) unit.
The 100 Tbyte hard drive has a 5 year warranty for continuous read/write, no data limit. Also the way I organized the massive associative memory you access say 100 or 1000 contiguous memory blocks per store/recall. So that will cost you 100 to 1000 times the latency (address setup time) of the drive, the amount of time to transfer a block of data being negligible compared to the setup time. If it is still not okay you can switch to Intel Optane non volatile memory which has quite low latency.
ah ok,… I see.
I’m quite new on this forum, so I didn’t follow your research. May I ask you, have you got some benchmarks of your approach or it’s in the very early stage yet?
The ideas about associative memory all work fine. I’ve been experimenting with them for many years. The rate limiting step is Walsh Hadamard transform. Hence for benchmarking you just need to time that. Written in plain C you can get 1000 65536-point WHTs per second on a single CPU core, and it increases from there with more specialized assembly language instructions and then again increases using a GPU.
As to similarity alignment I’ve kind of finished the information gathering stage and will think about it for a while. There was the idea of the brain being a data reservoir with information racing around it on random paths. Then when an event occurred such as bright light and loud noise occurring together the brain could do unsupervised learning of intersecting activity from that event. A simple readout layer could then be trained to respond any time the intersecting activity occurred again. Disentangling followed by a readout layer as Chklovskii is doing in a somewhat different way.
If one way or another you can disentangle incoming information by unsupervised learning then all you need is a little supervised learning of a simple readout layer to program wanted responses.
Another paper, this time from 2009 illustrating the same sort of ideas:
https://arxiv.org/abs/0906.5190
Or this one:
https://link.springer.com/content/pdf/10.1186/s41044-016-0010-4.pdf
There are a number of papers around the internet then that directly demonstrate the usefulness of unsupervised feature learning followed by a readout layer, and then some other papers such as on feedback alignment where that may be the actual mode of action. Very interesting. You can think up all kinds of schemes based on that notion such as random networks of Bloom filters which can gradually build up very rich sets of unsupervised features. And then the output of each Bloom filter is weighted and summed to give a readout value, with a little supervised learning to decide the weights.
I find this paper interesting: https://openreview.net/pdf?id=HyFaiGbCW
They show that reservoir neural nets can generalize better than ordinary deep neural nets.
Although, this paper is criticized by the reviewers.
You can dump all the input information into a reservoir at once or feed it in slowly.
If you feed it in slowly you can get digital signal processing like filtering effects, or you might say complicated resonances. Just as throwing a stone into a lake will cause waves that can reflect off the edges and cause interference patterns. I suppose you could find some similarities with convolution neural networks and standing waves set up by incoming data in a reservoir.
I use very simple reservoirs which are just vector length preserving linear random projections feeding back on themselves. They use nonlinear reservoirs.
A reservoir containing nonlinear functions will throw away information about the input at each step. Obviously an extreme nonlinear function such as a binary threshold function will throw away most of the information even in one step. After a while the reservoir states will follow a fixed trajectory with the input no longer able to influence any change in course. The same is true in deep neural networks. If you organize things in a reservoir or deep neural network so that information about the input is preserved or available again then you have the ability to start from scratch. If say 50 layers into a deep neural network the network concludes the answer I have come up with here is nonsense, it can start again from scratch and find a better answer over the next 50 layers because the input information is still available to do so.
And actually the situation with decision trees is similar. If you use say ID3 or something of that type you effectively discard information about the input that you can never use again as you go along. However there are ways of constructing decision trees using random projections where information about all the input is always available, and say somewhere in the middle of the tree it can start again from scratch. That could be a little wasteful in some ways but extremely pragmatic in others.
Similarity alignment is said to have some extra capabilities beyond k means clustering.
You can maybe relate that to some of Adam Coates work on invariant unsupervised feature learning:
https://youtu.be/wZfVBwOO0-k
Adam Coates speaks a lot like Sean Carroll the physicist.
It would be very interesting if you could use convergent sets of associative memory as a substitute for k means.
@Sean_O_Connor if you are interested in Information bottleneck theory, this paper may be interesting for you: https://openreview.net/forum?id=ry_WPG-A-
OK, I’ll look at it.
I suppose there are dozens of ways of using associative memory (AM) for clustering.
-
Like I said take 2 randomly initialized AMs and for each input train toward the vector length normalized sum of the the 2 recalled vectors. And maybe make the training rate inversely proportional to the distance between the recalled vectors to sharpen up the clustering.
-
Actually do k-means clustering and for each centroid assign a (random) orthogonal vector. Then train a single AM with each input vector, centroid vector pair. Maybe that would give some nice effects with decision boundaries/responses.
-
Batch-wise process. Start with a randomly initialized (vector to vector) AM then for a small batch of input vectors and for one output element at a time train that element to be either -1 or 1 , depending on whether the recalled element value was positive or negative for one of the batch vectors. Which is some sort of feature learning.
A student paper on blocked direct feedback alignment:
http://eyvind.me/blocked-direct-feedback.pdf
I suspect feedback alignment is a form of unsupervised learning followed by a readout layer. Though the topology is somewhat off to do that well, nevertheless there are things you can learn from the idea. When I’m set up to do coding again I’ll try some variant topologies.