Application of HTM in today’s ML frameworks

Nice, very interesting. Definitely want to see the results now. But yeah I have seen the entire htm school series.

1 Like

I’d take a guess that if y - y^hat is greater than a certain threshold, then the connections activated in a given forward pass in the SP are weakened, and vice-versa. But I guess we’ll see, won’t we.

I suspect (but haven’t tested) that this would create noise-resistant features pooling. I’ve been wanting to do with with image classification, but as life works, I just haven’t gotten around to trying it.

It’s impossible to backpropagate by the true gradient of a binary step function as the gradient would be 0 everywhere except for the threshold where it’d be infinite. Nothing makes sense.
Especially for SP, where the threshold or even the activation function itself is indefinite as the activation is determined by competition not by the direct activation by its input. It’s just a bit hard to formularize as a mathematical function although I think it’s possible.
So I used k-winner with linear activation in a similar manner as a max pooling layer in CNNs and used a similar derivative as well.

I assume what you’re suggesting is a classic SP(permanences not weights, thresholds not biases) with just the learning rule tweaked to take advantage of backprop.
It seems to me it’s tad weird as I don’t know why you would choose winner columns(nodes?) at the backpropagation(feedback) stage as it’s already decided at the feed forward stage.
I’m not sure if I’ve got it right. Correct me if I’m wrong.

I assume the yhat is the binary activation determined by competition.
I have kinda tried this approach only for inactive columns with threshold of zero.
But somehow during training, columns loose their semantic meaning and become more and more vague about what they’re representing so I ditched the idea.
I guess it could not be the case if you do this for every column instead of only inactive columns.
Also, what you’re suggesting doesn’t backpropagate gradients from layers after the SP layer resulting in a failure of efficient learning beyond the SP layer.
The easiest solution would be just adding gradients from the layer after the SP layer and it’s a common practice to use in DL in case of there are per layer constraints like this one.

What I’m suggesting is that where the SP layer wouldn’t even be factored into the gradient system and instead be divorced from backpropogation. Instead if the results were correct, the columns that assisted in that correct guess would be strengthened, while when answers are incorrect, columns in SP could either be left alone, or even weakened.

Thus (with correct guess):

  1. Input “cat” picture.
  2. CNN, including SP layers, process the data.
  3. Network guess == “cat”.
    3a. Normal DL layers are updated via backprop
    3b. SP layers are updated according to which columns participated in the correct answer (voting for column selection via 1-2 sigma standard deviation or other methodology)

There are certainly folks smarter than myself, who given time and resources can surely figure out how to make this work :wink: … and it seems that they already may be on the way to doing so.

1 Like

So is it an adversarial learning between SP and DL layers?
I’m not quite sure if I’m picturing right.
I don’t know how the information would flow through a such structure.
Could you draw a simple diagram or something if you don’t mind?

I’ll do that tomorrow. After midnight here, so off to bed.

Pretend that your neurons in your network are just like a bucket brigade, passing buckets of water through the crowd. You’re at the end of the line giving feedback and direction to people on the line. The main idea of my thought is that the SP is just a random, helpful stranger in the middle of your crowd of neurons… He doesn’t care about your gradients or your backpropogation values at all. He’s just there doing his own thing while helping you out… At the same time, if you (during the backprop process) shout at all your neurons to tell them that they were wrong, he’ll take a moment to look at what he did, and maybe do less of that. If you’re shouting at all your neurons that they did great, again he’ll take a look at what he did, and strengthen that. The backprop stage, as far as this SP stranger is concerned, is just a moment for self reflection, to determine if what he did in this pass was helpful, or not. He doesn’t care about what gradients you assign to all the neurons in the bucket line, because it doesn’t apply to him. You don’t control him so much in that way. He only cares about himself, and the binary (maybe trinary “good”, “neutral”, “bad”) state of whether or not he helped or hurt the overall goal.

Last edit:
Imagine though, that you have one of your neurons at the front, receiving end of the line, with a whole bunch of these SP strangers helping out… at that point, you would have a little bit of game theory starting to take place.

Morning edit:
Maybe a better analogy would be to think of a college classroom. Your traditional DL neurons would be graded students, who receive information, take tests/quizzes/assignments, and receive a grade. Based on that grade, they adjust appropriately (low grades inspired harder work, high grades will adjust very little). Then you have those folks who are auditing the class. They might participate in discussions and other activities, contributing to the overall success of the classroom experience, but they’re essentially self-monitoring, where an instructor might take a quick look and say “That looks good” or “That’s not right.”, but no percentage or grade is assigned to any of the work.

1 Like

The whole purpose of backpropagation is to just move the probability in the right direction. If you know the direction you need if the input is above or at the inflection point, then you give it the amount you want to adjust. Likewise for below. Sure mathematically a vertical lines slope is undefined or infinite, but we aren’t looking for a slope of on or off, we are looking for the slope to correct towards a maximum of 1 and a minimum of 0 given a threshold or inflection point of 0.4 or whatever. We don’t need to stick ourselves in a box, we can just account for the possible states and correct towards them.

For a moment I didn’t get what you were saying but then I realized. I was making the assumption you would only chose the winners after the backpropagation and in forward propagation it would just be normal matrix multiplication with linear activation. But it sounds like your thoughts are you chose the winners AS the activation function. Which I hadn’t considered. That’s an interesting take on how it might work. I was thinking the winners would be calculated by how close they were to the answer, I’ve been in a different mindset of softmax selection and winner take all algorithms for the past 2 and half months. So that’s where my bias was coming from.

After thinking about it just right now, I can see how either one could mess up the math. I’d have to do physical tests to see what the results would be. I’m running up to the edge of running the scenarios in my head. There is a lot of stuff I haven’t accounted for, which I suppose is fine. It’s mostly just Numenta research paper conspiracy theories rattling around up there at this point :smile:. I’m gonna be writing on the walls and eating scooby snacks here pretty soon.

But I’m too lazy to test any of this out since it’s already been done and I have a million other things I need to do anyways, so I’ll just wait for the paper to come out and read it then. Anyways have a nice night.

2 Likes

I think I get it now. Thanks for the explanation. :smiley:
That’s such an interesting concept! I’ve never thought of something like that before.
It’s like SP working along with a DL layer and learning from/with another rather than backprop directly.
I can see it could also form a hierarchy independent of DL layers after learning.
I wonder if it’d actually work. I would certainly like to try!

That’s such a nice interpretation of backprop! I was kinda aware of that though. I’m also aware there are actual binary DL networks that are trained in such manner. I just wanted point that out about gradient descent in its pure definition. I’m sure most of the people are aware that even ReLU, the most widely used activation function itself is not differentiable at zero.

Choosing the winners during learning is an interesting concept but I wounder if it would make a significant difference.
And I suppose what you’re referring as linear activation is ReLU-like(linear growth with nonlinear at some point)?

I agree that this is a hard concept for people coming from the DL side of it. SP is not really a mathematical function, it is a learning algorithm itself.

The SP does not need backprop to learn. It learns via an internal mechanism (inhibition or the “minicolumn competition”). Even if there is no topology in this configuration (meaning global inhibition), the SP still learns to pool spatial patterns into sparse representations. And it can learn this completely unsupervised and online (although the “online” part is not going to work in today’s DL frameworks like pytorch).

1 Like

I’m sorry for the misleading statement but I was referring to only the feedforward and k-winner part of SP.

Why is that? I have no experience in those kinds of platforms so I have no knowledge about them.
Are there any reasons why online learning wouldn’t work with them?

1 Like

I’m not so sure about that. I’m using a Variable to represent neuron energy/ boosting in tensorflow. I think a Variable could also be used for a sparse array representing connections, as long as the size doesn’t change too much.

I have found it funky to work with and optimize, but it should be doable.

2 Likes

The SP has to have boosting to spread meaning across all the bits, but you can only perform boosting between epochs. It cannot continuously learn as far as I can tell. So you perform boosting between training epochs.

1 Like

I looked into this a little, and I think it has to do with GPU hardware. It might be solvable by using GPU alternatives like the Intel Xeon Phi, or something from OpenPower. Advanced deep learning GPUs might have some solutions to this, but I’d need to spend a lot of time looking into them.

I’m definitely not an expert on this, but I think the in-block DRAM GPU cores can use already take hundreds of cycles to read/write, so another dedicated thread or piece of hardware talks to that, but that still means the GPU program needs to do many more calculations than shared, in-block memory accesses to be efficient. After that, blocks don’t communicate directly, so they’d need to access shared memory further up, which takes even longer.

Edit: not sure if it’d be too much of an effort to enable minimal ability for local boosting on a GPU. The problem is sharing memory between neighboring blocks, and general memory access speeds, so adding a very small amount of storage (read: a few bytes) accessible by neighboring blocks/cores could help.

1 Like

It could be inefficient. But it shouldn’t be impossible.
Today’s GPUs are general purpose enough to perform any computations that CPUs can as far as I know.

1 Like

Probably true, given the amount of AI centric GPUs out there. I think it’d be a simple add on.

All I need to make my rudimentary 3x3 spatial pooler work fully on the GPU is for each compute unit to share a 1 or 2 byte cache with the 3x3 surrounding compute units, which shouldn’t be that much of an addition to store the results of a convolution operation on the GPU.

However, even if there aren’t any GPUs like that, there is open source GPU code that can create a virtual GPU on FPGAs, and FPGA cloud services. It could be worth it to create a full tensorflow HTM implementation with variables, then create custom ops for a custom GPU to replace those variables one by one.

Edit: replace all “spatial pooling” mentions with “boosting”.

1 Like

The problem isn’t the hardware, at my old work we used both the Phis and Nvidia gpus. There are some pretty big differences in the hardware but that has nothing to do with the online or offline learning.

The problem with online learning is it is essentially going to be SGD every time you infer, and there is a reason batch normalization exists in the first place. Its so that you can control the randomness in the prediction after every update. It’s well known that you can get wildly different predictions even after one update with SGD, which also means longer training times. There has been lots of work in the DL community to make sure that there are standardized reliable ways to reproduce models, like seeded random values, same training order, batch normalization etc. Online learning throws all of that away, for no real reason.

Regardless, even if the SP solved the noisiness of SGD you would still have to run backpropagation after every inference, so on top of the already existing inference times, you are adding backpropagation and training times to each run. The times and resources for that type of inference don’t scale well and its easier just to save money and do the learning once at really high accuracy and then only do inference. If you do only inference even mobile phones can easily run models. This is why even people like Hinton would like to replace backpropagation.

You’re just going to be chasing rabbits if you are trying to force an online learning environment for DL systems.There is no real need to and it doesn’t really give you any benefit. It’s still cheaper to collect the samples from inference runs and then update the model after x amount of samples have been collected.

But it really doesn’t have to do anything with hardware, if you made optimized hardware for the backpropagation then they are optimized for inference in which case the same situation applies, but just faster.

1 Like

How? Online learning in HTM on the CPU has had these properties. Why would putting it on the GPU remove them? As for handling batch normalization and stochastic gradient descent… I don’t know if spatial poolers have been tested with that yet.

Actually, after reading a bit on batch normalization, it seems like spatial pooling does something similar, if you replace the value of a neuron in deep learning with the firing rate or a neuron in HTM.

That’s completely fine. If learning were applied at the end of a training session, that’s fine because it’s a training session and not an interactive situation.

However, due to the sparsity, at some point before training ends, learning would need to be applied so new synapses can form, and accuracy can be tested against those. Most likely, the net would need these updates a lot to begin with, but would need them less often later in training. I’m not sure how different that is from standard deep learning.

So, I guess I agree. I still think it’d be nice to have some slightly optimized hardware for things like spatial pooling or other convolution operations.

Edit: replace all “spatial pooling” mentions with “boosting”.

1 Like

You’re saying the problem is with how DL works, not the hardware nor the frameworks, right?
I’ve kinda ran into the problem when I made DeepHTM.
I just solved it by updating the parameters every 10 steps instead of updating them every time.
You can think of it as having the size of minibatch as 10.
It was a temporary solution and might not scale well.
But there are many techniques such as layer normalization that can help with online learning.
I’ve tried them, then it kinda worked.
And I used the pure SGD but backprop techniques like RMSProp or even momentum could help as well.
Correct me if I’m wrong, but wouldn’t many of deep RL not work if online learning isn’t possible?
Also, shouldn’t an HTM system or any other online learning system that’s implemented with the ML frameworks work just fine if the problem is with how DL works not with the frameworks?

1 Like

This is not how we’re doing it. We’re not calculating firing rates or anything. We just use the SP to make the connections sparse, but we keep them floats. Stay tuned I’ll be working on this all week. Will have code as soon as possible.

3 Likes