How does learning happen in the HTM algorithm?
I’m a little confused. Please explain why in the article "The HTM Spatial Pooler – A Neocortical algorithm for online Sparse Distributed Coding."There is only a test phase for the hot gym datasheet, while there is both a test and train phase for the MNIST dataset?
This paper trains the MNIST database in one epoch while training the random-SDR database in 50 epochs.
My question is, why learning in the SP algorithm is different in different datasets?
please help me to underestand
Without really knowing much about your background, I’d suggest that you take time to go through and understand this playlist. It builds up some of the fundamentals, basic operations, before exploring the encoding space, the spatial pooling algorithm, the temporal pool, and extending slightly beyond that.
Otherwise, wish you a peaceful weekend.
Thanks for your explanation. I think I did not understand what you meant.
There is a training and testing phase in SP (like MNIST classification), but we only have a testing phase in TM (like hotgym prediction)? Is it for the type of HTM neurons that act like the human brain?
I did not understand why one training epoch was used in the MNIST classification problem, but 40 epochs were used in the training phase in random SDR problem. Please explain to me.
Thank you very much
At least on the MNIST dataset, the reason of having separate learning and testing phases is that is how it was designed - to compare/benchmark various machine learning algorithms all having to obey with the rule: they have to use the first 60000 digits for training and last 10000 for testing.
It is not a time series test it is not meant to make time predictions, only classification. Classification in ML does not care and should be insensitive to the order in which input patterns arrive.
TM is meant to find time sensitive patterns in which order of input data is significant and the problem itself is to predict the next input pattern after the algorithm is fed an series of predecessor patterns, in this case order is very significant.
That’s why is needed a repetition of patterns in learning - in order for the algorithm to figure out “cycles” within data.
As with SP is shown many versions of digit “2”, “3”, etc… in order to learn the differences between different classes /similarities within the same class, the TM is fed several input cycles in order to “understand” what is similarity and what dissimilarity in the input series.
Does learning occur in 1 epoch for the MNIST problem, or does it require more epochs? According to the code sent below, I got the classification accuracy in different epochs, but I do not know how many epochs are enough?
I expected that the higher the number of epochs, the higher the classification accuracy, but as you can see in the figure below, this did not happen.
Well, as you see it doesn’t improve much after a couple epochs. The top 1% is reached after a single one.
From HTM’s perspective one epoch should be sufficient considering the training dataset provides ~6000 samples for each digit.
In order to make sense of what all that means you have look closer how SP is used in the MNIST example.
The whole program makes classification in two layers. The first layer is the spatial pooler (SP) itself.
The second layer is the so called SDRClassifier, which in itself is not part of HTM model, but an online variation of a “classic” single-layer, multi-class perceptron which is trained on the the 6241 bit SDR produced by the Spatial Pooler.
This might sound a bit complicated.
What all of the above means, and how does Numenta’s two-layer neural network differs from a “normal” two-layer neural network.
Starting with a “classical”, equivalent two layers neural network:
- there is an input layer (784 pixels of a MNIST image), a hidden layer (6241 values in our example) and a 10 values output layer representing digits 0 to 9.
- using prediction errors on training data and backpropagation, all weights between first (input->hidden) and second (hidden->output) are adjusted together, incrementally to reduce prediction errors.
What is important to recall is that at every prediction error, weights are adjusted in both of the above layers to improve the output. Backpropagation, in short, first “measures” how the last layer weights need to be changed in order to improve prediction, and based on that measurement goes next layer back and changes those weights too.
Now how is Numenta’s example different:
- as I mentioned the first layer (input->hidden) is the Spatial Pooler, second (and last) one is SDRClassifier pretty much the same as in “normal”, backpropagated neural networks.
- The difference is there is no backpropagation in between - the SP layer doesn’t know or care about the adjustments made within the last layer (SDRClassifier). Spatial Pooler layer simply runs its own “inner learning” which changes the weights in the first layer in a manner that it produces a more useful information (hidden output) for the second, classifier layer.
So there is no back propagation, which in classical neural networks is an algorithm designed to make incremental changes in multiple (more than 1) layers from output, back through all existing layers down to the input layer.
Ok so for my, potentially biased summary:
- you can’t replicate (as we don’t really know) how we, humans, run the whole chore learn new stuff like hand written digits.
- Numenta made some biologically informed assumptions about how a small patch of the cortex does “learning”, and implemented them in SP
- In order to test their assumptions are plausible, and without the capability to simulate a whole cortex, they use the SDRClassifier layer as shortcut to show that SP does its tiny part of the whole currently unknown learning pipeline:
After it is exposed to and “learns” a dataset it produces an output more useful to whatever comes next (following layers) than its input itself.
So SDRClassifier is doing the actual “classification” the Spatial Pooler only adjusts itself (by “learning”) to transform its input data in a more … “meaningful” representation.
Two key points here:
- The SDR classifier makes better prediction when it “sees” the output of SP than when it sees the original digits themselves.
- The overall prediction improves after SP makes its own internal “learning” which proves there is a learning process involved not just the fact the 784 input points are randomly projected into 6241 points SDR in the hidden layer.
What really bugs me about the MNIST example is the spatial pooler breaks the rule (3) in Numenta’s own presentation, namely:
These properties include …
(3) forming fixed sparsity representations
In the MNIST example the untrained SP produces 5% sparse output but the classification improvement are accomapnied by a higher density of the spatial pooler output.
In order to sustain their whole arguments on how Sparse Distributed Rrepresentations are better (=necessary and sufficient) than “dense” representations I think the SP in the example example should NOT increase density of its output SDR from 5% to 7% (or even >9% in higher ranking results) during learning.
Perhaps what is missing from the spatial pooler training algorithm is a sparsity prior. For example something like an L1 or L0 regularization (penalty) term. If it’s simply optimizing on reducing the number of errors produced by the SDRclassifier, then it may be allowed to use as many (or few) connections/activations as it needs to attain that goal.
If it’s simply optimizing on reducing the number of errors produced by the SDRclassifier, then it may be allowed to use as many (or few) connections/activations as it needs to attain that goal.
There-s no feedback from SDRClassifier back to SP, which is totally blind on who and what is doing with its output SDR. Which is fine, proves a brain-inspired algorithm does learn something without crutches from other component(s) with much weaker biological justification.
Thanks a lot for your excellent explanation and your kindness. Can you also explain learning in TM (for example, hot gym database)? How is the next data predicted in the TM algorithm? How can this algorithm learn data without having a training step?
For applications where the data is not in a huge database but is rather streamed from a sensor or other source of real-time data, training is done online, or as the data arrives. In this case, training must be done on the data as it is seen by the network, hence the term online training. (Or online learning, but that has different connotations.)
The network is typically initialized with no connections or some random connections with permanences set very close to the threshold value. HTM uses permanences to determine if a synapse is connected rather than weights to determine its strength because real synapses are more binary (i.e. more like a switch than a transistor). If the permanence is at or above the threshold value, then it is able to transmit activations from the presynaptic neuron to the postsynaptic neuron.
Training occurs by increasing or decreasing the permanence based on whether or not the presynaptic neuron fired in close-proximity (temporally) to the postsynaptic neuron. The permanence value for a synapse is increased when the presynaptic neuron fires just before the postsynaptic neuron (Hebbian learning) and is decreased when the reverse happens (anti-Hebbian).
It is possible to have both excitatory and inhibitory neurons in the same network. In which case the excitatory neurons would encourage a postsynaptic neuron to fire, while inhibitory would decrease that likelihood.
In HTM, there are typically two types of synaptic connections which are associated with the type of dendrite on which they reside: proximal and distal, and sometimes a third: apical. Each of these serves a distinct purpose.
The synapses on the proximal dendrites are typically considered to be the feed-forward connections. These are the synapses that will typically cause the neuron to fire when enough synapses are active within a small interval of time.
The distal dendrites are more recurrent in nature, in that they are often made between other neurons in the same layer and/or region (as opposed to the feed-forward connection that are more likely to be between neurons in different layers and/or regions). These distal connections will typically not cause the neuron to fire directly, but will instead bias the neuron so that it will either fire with less activations on its proximal dendrites, or fire more quickly than other nearby peer neurons when their proximal activations arrive. In the latter case, the neuron that fires first is able to generate a local inhibitory signal that suppresses the activation of other peer neurons in close proximity. We describe such a neuron as having been placed in a predictive state by the activity on its distal dendrites. When the proximal input arrives, the predicted neurons will have the opportunity to fire before any of its peers, and thus reinforce its connections while preventing other nearby neurons from taking on a similar role in the network.
The apical connections (if present) are typically associated with top-down inputs (e.g. hints for what to expect from further up in the network hierarchy). They function in a manner similar to the distal dendrites, although they can sometimes be made to generate neural activations if their signal is strong enough. (As far as I know the activation part is still speculative, but may explain various observed behaviors.)
The network generated by these two or three connection types is able to learn about its input and generate predictions of what it expects to see on the fly. When a novel sequence of inputs arrive, many neurons may receive enough activations on their proximal dendrites to activate. When this happens, there is competition to determine which ones will form distal connections to other nearby peer neurons that were recently active. If some of these neurons are able to form enough distal connections, then the next time that sequence arrives they will be able to detect the precursor patterns through the distal activations and thus be put into a predictive state. This will then allow them to fire before their piers and further strengthen their role in detecting and encoding that particular sequence of inputs. This also frees up the other nearby peer neurons that could have also have become a part of that sequence to be used in other sequences.
It might be interesting to set up a small experiment where a result from psychology was reproduced. I am thinking of this paper:
Very old study that showed consciousness/attention/awareness (pick one, not really significant here) is not required for learning.
It’s been a while since I’ve looked at the SP implementation. What is the learning rule or objective function that is being optimized by the spatial pooler? Is it simple Hebbian learning, or does it also include the boosting term to help improve the overall utilization of all of the available neurons? (Or is it something else?)
I have no idea, all I have is a black-box-user perspective. All I can say the function you ask for is within the black box.
Or at least that is how SP is used in the MNIST example.
In the hot gym example TM is supposed to play the role of a real time anomaly detector. This might sound confusing, what it actually means?
Assume you have a car factory, and want a high level AI watching the whole process to check something isn’t running correctly.
Now let’s narrow this view down to the paint drying oven. That’s like a garrage with electric heaters all around, in which a freshly painted chassis slides in, is heated a while, ventilated to trap noxious fumes and left to slowly cool down.
Finally it is pulled out of the oven and a new freshly painted chassis repeats the cycle.
If this were the whole story it would be simple to record one “correct” cycle and test all following ones, but real factory life isn’t. There are several types of cars, paints and car parts that need different temperature curves and timings.
There are scheduled production breaks - like nights, shift changes and weekends - or unscheduled production breaks which are important.
All these overlap in a much harder to manually code pattern than the simply repeating one-type-chassy one-kind-of-paint cycle.
An anomaly detector (AD) is supposed to watch a few variables - e.g. temperature, noxes - in time, learn the actual cycles within the process, assume (unless told so) it sees a “normal” process, remember it and raise a flag when anything unexpected happend from AD’s point of view, so that a human or higher level program or AI gets notified.
Of course the AD will rise the flag every time something in the process is changed, but that is fine the supervisor humans/programs would expect anomalies detected all over the factory every time the production changes.
That was a too long story. What is important to notice is the anomaly detector cannot be programmed or trained in advance because we cannot know how processes will change (new parts, new paints, changes in production line, etc…) so we can not have real life data about paint drying oven in the future.
The anomaly detector has to learn these changes and very quickly, preferably from a few examples of new versions on the cycle.
And that’s why the TM isn’t fed “training” and “test” data sets, because real life happens online. Every new measure is used both test and training.
TM could be trained the “classic ai” style - with thousands or millions of successive measurements, and thousands of test measurements, but we cannot have that in a production line.
It has to handle (==learn and predict) online data, updating its internal predictor with every change in the input pattern, because the production line can not afford hundreds or thousands of failed parts just to teach the anomaly detector which is a “good” and which is a “bad” cycle.