Application of HTM in today’s ML frameworks


#21

I think I get it now. Thanks for the explanation. :smiley:
That’s such an interesting concept! I’ve never thought of something like that before.
It’s like SP working along with a DL layer and learning from/with another rather than backprop directly.
I can see it could also form a hierarchy independent of DL layers after learning.
I wonder if it’d actually work. I would certainly like to try!


#22

That’s such a nice interpretation of backprop! I was kinda aware of that though. I’m also aware there are actual binary DL networks that are trained in such manner. I just wanted point that out about gradient descent in its pure definition. I’m sure most of the people are aware that even ReLU, the most widely used activation function itself is not differentiable at zero.

Choosing the winners during learning is an interesting concept but I wounder if it would make a significant difference.
And I suppose what you’re referring as linear activation is ReLU-like(linear growth with nonlinear at some point)?


#23

I agree that this is a hard concept for people coming from the DL side of it. SP is not really a mathematical function, it is a learning algorithm itself.

The SP does not need backprop to learn. It learns via an internal mechanism (inhibition or the “minicolumn competition”). Even if there is no topology in this configuration (meaning global inhibition), the SP still learns to pool spatial patterns into sparse representations. And it can learn this completely unsupervised and online (although the “online” part is not going to work in today’s DL frameworks like pytorch).


#24

I’m sorry for the misleading statement but I was referring to only the feedforward and k-winner part of SP.

Why is that? I have no experience in those kinds of platforms so I have no knowledge about them.
Are there any reasons why online learning wouldn’t work with them?


#25

I’m not so sure about that. I’m using a Variable to represent neuron energy/ boosting in tensorflow. I think a Variable could also be used for a sparse array representing connections, as long as the size doesn’t change too much.

I have found it funky to work with and optimize, but it should be doable.


#26

The SP has to have boosting to spread meaning across all the bits, but you can only perform boosting between epochs. It cannot continuously learn as far as I can tell. So you perform boosting between training epochs.


#27

I looked into this a little, and I think it has to do with GPU hardware. It might be solvable by using GPU alternatives like the Intel Xeon Phi, or something from OpenPower. Advanced deep learning GPUs might have some solutions to this, but I’d need to spend a lot of time looking into them.

I’m definitely not an expert on this, but I think the in-block DRAM GPU cores can use already take hundreds of cycles to read/write, so another dedicated thread or piece of hardware talks to that, but that still means the GPU program needs to do many more calculations than shared, in-block memory accesses to be efficient. After that, blocks don’t communicate directly, so they’d need to access shared memory further up, which takes even longer.

Edit: not sure if it’d be too much of an effort to enable minimal ability for local boosting on a GPU. The problem is sharing memory between neighboring blocks, and general memory access speeds, so adding a very small amount of storage (read: a few bytes) accessible by neighboring blocks/cores could help.


#28

It could be inefficient. But it shouldn’t be impossible.
Today’s GPUs are general purpose enough to perform any computations that CPUs can as far as I know.


#29

Probably true, given the amount of AI centric GPUs out there. I think it’d be a simple add on.

All I need to make my rudimentary 3x3 spatial pooler work fully on the GPU is for each compute unit to share a 1 or 2 byte cache with the 3x3 surrounding compute units, which shouldn’t be that much of an addition to store the results of a convolution operation on the GPU.

However, even if there aren’t any GPUs like that, there is open source GPU code that can create a virtual GPU on FPGAs, and FPGA cloud services. It could be worth it to create a full tensorflow HTM implementation with variables, then create custom ops for a custom GPU to replace those variables one by one.

Edit: replace all “spatial pooling” mentions with “boosting”.


#30

The problem isn’t the hardware, at my old work we used both the Phis and Nvidia gpus. There are some pretty big differences in the hardware but that has nothing to do with the online or offline learning.

The problem with online learning is it is essentially going to be SGD every time you infer, and there is a reason batch normalization exists in the first place. Its so that you can control the randomness in the prediction after every update. It’s well known that you can get wildly different predictions even after one update with SGD, which also means longer training times. There has been lots of work in the DL community to make sure that there are standardized reliable ways to reproduce models, like seeded random values, same training order, batch normalization etc. Online learning throws all of that away, for no real reason.

Regardless, even if the SP solved the noisiness of SGD you would still have to run backpropagation after every inference, so on top of the already existing inference times, you are adding backpropagation and training times to each run. The times and resources for that type of inference don’t scale well and its easier just to save money and do the learning once at really high accuracy and then only do inference. If you do only inference even mobile phones can easily run models. This is why even people like Hinton would like to replace backpropagation.

You’re just going to be chasing rabbits if you are trying to force an online learning environment for DL systems.There is no real need to and it doesn’t really give you any benefit. It’s still cheaper to collect the samples from inference runs and then update the model after x amount of samples have been collected.

But it really doesn’t have to do anything with hardware, if you made optimized hardware for the backpropagation then they are optimized for inference in which case the same situation applies, but just faster.


#31

How? Online learning in HTM on the CPU has had these properties. Why would putting it on the GPU remove them? As for handling batch normalization and stochastic gradient descent… I don’t know if spatial poolers have been tested with that yet.

Actually, after reading a bit on batch normalization, it seems like spatial pooling does something similar, if you replace the value of a neuron in deep learning with the firing rate or a neuron in HTM.

That’s completely fine. If learning were applied at the end of a training session, that’s fine because it’s a training session and not an interactive situation.

However, due to the sparsity, at some point before training ends, learning would need to be applied so new synapses can form, and accuracy can be tested against those. Most likely, the net would need these updates a lot to begin with, but would need them less often later in training. I’m not sure how different that is from standard deep learning.

So, I guess I agree. I still think it’d be nice to have some slightly optimized hardware for things like spatial pooling or other convolution operations.

Edit: replace all “spatial pooling” mentions with “boosting”.


#32

You’re saying the problem is with how DL works, not the hardware nor the frameworks, right?
I’ve kinda ran into the problem when I made DeepHTM.
I just solved it by updating the parameters every 10 steps instead of updating them every time.
You can think of it as having the size of minibatch as 10.
It was a temporary solution and might not scale well.
But there are many techniques such as layer normalization that can help with online learning.
I’ve tried them, then it kinda worked.
And I used the pure SGD but backprop techniques like RMSProp or even momentum could help as well.
Correct me if I’m wrong, but wouldn’t many of deep RL not work if online learning isn’t possible?
Also, shouldn’t an HTM system or any other online learning system that’s implemented with the ML frameworks work just fine if the problem is with how DL works not with the frameworks?


#33

This is not how we’re doing it. We’re not calculating firing rates or anything. We just use the SP to make the connections sparse, but we keep them floats. Stay tuned I’ll be working on this all week. Will have code as soon as possible.


#34

True, but elapsed over a long period of time, boosting would have about the same effect on firing rate as batch normalization would have on values of neurons. I mixed up my terminology.

I’m planning on working on my code every day after work this week. It’s all in python and tensorflow. Could I help out?


#35

I’m just amazed by your insight.
That’s such a great way of viewing them!


#36

How can SP form sparse connections if the inputs are dense?
Doesn’t SP form sparse connections because of its sparse inputs?
Wouldn’t the connections turn out to be at least dense as the inputs?


#37

Sorry @SimLeek and @hsgo I should clarify, I was talking only about the spatial pooler in the context of deep learning frameworks as a layer. Not the spatial pooler by itself, apologies.

Yes there are plenty of SGD methods, you can even find a useful list with enough information on the wikipedia page under Stochastic Gradient Descent. Here is a link so you can read up on it. They have plenty of examples. What I don’t think it mentions is that most of these methods will benefit from batch normalization. But again, that’s not the problem. You are adding time to every inference run.

If i is inference run time, and b is whatever backpropagation and learning method you use, then i < i+b and will always be less than that. If you optimize i, it will still be less than any version of i and b. There is no getting around it.

It depends on how they are doing it, but if online learning were impossible then yes, it wouldn’t work. But it’s not impossible so it does work.


#38

@SimLeek I should also mention that I have not experimented with any HTM or Spatial Pooler concepts on Phis or GPUs only CPUs. I was assuming you were talking about HTM concepts within the context of HTM Integrated Deep Learning Systems. My bad. My comments were on my experience with DL on Phis and GPUs.


#39

Heh. Seems like there’s confusion all around.

I was talking about HTM Integrated Deep Learning Systems though. But I was mostly referring to boosting, and a few actual spatial pooler mentions, since that was brought in.

I want to take things one step at a time. Boosting is useful for my image recognition, since I can get more sparsity while still getting the whole input over time, which limits the calculations I run on the CPU while still eventually getting the full image, as well as important updates.


#40

Please review the mechanism. Numenta uses a k-means voting to select a winner for a given area; this reduces the local population activation and produces the sparsification.