A spatial pooling layer in the middle of a DL model would essentially just be a hidden k-means linear activation layer. Why are you making a paper on that? Those already exist.
The SP is non linear. We also show how noise resistance improves and computation costs go down. Stay tuned.
I can see how computation would go down, just like any other dropout can reduce computation. Being more noise resistant by default means you won’t be backtracking in the backpropagation portion. How is SP non linear though. It goes up and down a fixed amount. Unless you are saying the activation you use is non linear. Which I guess then, its just a dropout layer basically but with a bit more intent behind it. I don’t know I’ll just read the paper I guess. Just seems a bit of an odd choice.
-Ignore the whole non-linear part. I think I remember that the weights don’t go below zero, which makes it like a RELU I suppose, and capping at 1 I guess would make it non linear as well.
Sorry for asking an off topic question but how is the gradient evaluated if SP activation is binary?
Wouldn’t backprop not work without tricks such as using a differentiable activation function instead of binary or approximating the derivative by a normal distribution?
SP activation does not have to be binary. In our experiment in pytorch we kept everything scalar. Stay tuned we just submitted this to a conference, and I’m not sure what the privacy rules are about submissions. But I’ve read the paper and it looks like exciting results to me. I promise I’ll share it as soon as we are able, but I don’t know all the details yet.
So is it a layer-wise activation function(k-winner) using linear activation(y=x) on each node?
That’s what I used when I made DeepHTM.
But wouldn’t it violate some HTM philosophy(the neocortex uses binary)?
P.S. super excited for the paper! XD
Hey @hsgo, thought I might give you some of my thoughts. Take it with a grain of salt because it could be completely off the mark in terms of how they do it, but you might come to better conclusions than me with the way I’m thinking it might work.
(You might know this so this is for others unfamiliar with how to calculate the derivatives)
So all you need to know for a derivative is the amount to adjust a value, given an input. For a RELU, one of the simplest cases, you just need to know if the input is above 0 then you return itself, if it is below 0 you return 0. You are just looking for the slope of the function when you calculate a derivative.
In complex functions like sigmoids, soft arg maxes, tanhs, sines, etc. you can’t store every single correction given an input or make an if statement for every input because it is an infinite number of cases, so you calculate the derivative. But with simple if statement functions the derivative is equally as simple.
My assumption on the Spatial Pooler as a layer is that those with the smallest error rate fall into the true case and are either incremented the flat amount of 0.1 or a learningRate modified version of the error. The rest that fall into the false case are decremented by 0, a flat decay amount, or the learningRate modified version of the error and then they are capped at 0 and 1. That’s just my assumption though. Without modifying the purpose of the spatial pooler too much, I would guess that it will be something along those lines. But that would be one of the more simple and straight forward ways. Especially if you are trying to reduce computation.
The reason I call it an odd choice though, is that it doesn’t really use or prove (if that is actually what the method is) any other htm theory concept. It is just taking advantage of the sparsity of a matrix, which is already a well known concept in machine learning. I was under the assumption that numenta was trying to move away from the optimizations arena and into cortical algorithms, so I’m really curious to know what the purpose of the paper is when it comes out.
But like I said, this could be all way off the mark.
For the past couple years, we have not been focused on applications at all. We’ve changed that a bit recently. @Subutai talked about how we are now looking for ways to apply components of our theory into ML frameworks in Numenta's 2018 Year in Review. The most obvious place to start is with SDRs. And the way the SP creates sparsity retains semantic meaning, so it is more valuable than something like dropout. Instead of removing random connections, the SP is learning.
Right, Well I can see how it retains spatial relations, but almost all of the functions are very localist by nature and things like CNN’s already capture extremely localist features. So wouldn’t it be better to test out spatial poolers with LSTMs or some sort of other method that is really good at sequences. Offload all of the spatial context from traditional functions to the spatial pooler and discard them, and then let the LSTM’s abilities shine? Especially if we know that cortical columns have more in common with recurrent networks than other types of models and layer types.
Kind of feels like you are inserting a marginally better method into an already cluttered domain when you can just remove the competitors entirely and use it with LSTM’s. That would really cut down on computation.
You’re right, we could look into running with / against LSTMs doing sequence memory, but what can we bring to the huge spatial game everyone is playing today? We are looking for the biggest bang for the least amount of effort. If injecting an SP to provide sparsity retains performance while significantly improving noise tolerance, that could be a very big win in a very big domain.
I suppose. Just noise tolerance alone is worth something to the industry. So I guess I can see how the industry would find it pretty useful. 5% extra accuracy could mean 5% extra sales, so even if the accuracy gains are marginal I could see how that could be super enticing.
The noise tolerance gains are not marginal. I don’t know if you saw my earlier SDR episodes of HTM School, but all those noise tolerance visualizations apply here. SDRs are extremely noise tolerant.
Nice, very interesting. Definitely want to see the results now. But yeah I have seen the entire htm school series.
I’d take a guess that if y - y^hat is greater than a certain threshold, then the connections activated in a given forward pass in the SP are weakened, and vice-versa. But I guess we’ll see, won’t we.
I suspect (but haven’t tested) that this would create noise-resistant features pooling. I’ve been wanting to do with with image classification, but as life works, I just haven’t gotten around to trying it.
It’s impossible to backpropagate by the true gradient of a binary step function as the gradient would be 0 everywhere except for the threshold where it’d be infinite. Nothing makes sense.
Especially for SP, where the threshold or even the activation function itself is indefinite as the activation is determined by competition not by the direct activation by its input. It’s just a bit hard to formularize as a mathematical function although I think it’s possible.
So I used k-winner with linear activation in a similar manner as a max pooling layer in CNNs and used a similar derivative as well.
I assume what you’re suggesting is a classic SP(permanences not weights, thresholds not biases) with just the learning rule tweaked to take advantage of backprop.
It seems to me it’s tad weird as I don’t know why you would choose winner columns(nodes?) at the backpropagation(feedback) stage as it’s already decided at the feed forward stage.
I’m not sure if I’ve got it right. Correct me if I’m wrong.
I assume the yhat is the binary activation determined by competition.
I have kinda tried this approach only for inactive columns with threshold of zero.
But somehow during training, columns loose their semantic meaning and become more and more vague about what they’re representing so I ditched the idea.
I guess it could not be the case if you do this for every column instead of only inactive columns.
Also, what you’re suggesting doesn’t backpropagate gradients from layers after the SP layer resulting in a failure of efficient learning beyond the SP layer.
The easiest solution would be just adding gradients from the layer after the SP layer and it’s a common practice to use in DL in case of there are per layer constraints like this one.
What I’m suggesting is that where the SP layer wouldn’t even be factored into the gradient system and instead be divorced from backpropogation. Instead if the results were correct, the columns that assisted in that correct guess would be strengthened, while when answers are incorrect, columns in SP could either be left alone, or even weakened.
Thus (with correct guess):
- Input “cat” picture.
- CNN, including SP layers, process the data.
- Network guess == “cat”.
3a. Normal DL layers are updated via backprop
3b. SP layers are updated according to which columns participated in the correct answer (voting for column selection via 1-2 sigma standard deviation or other methodology)
There are certainly folks smarter than myself, who given time and resources can surely figure out how to make this work … and it seems that they already may be on the way to doing so.
So is it an adversarial learning between SP and DL layers?
I’m not quite sure if I’m picturing right.
I don’t know how the information would flow through a such structure.
Could you draw a simple diagram or something if you don’t mind?
I’ll do that tomorrow. After midnight here, so off to bed.
Pretend that your neurons in your network are just like a bucket brigade, passing buckets of water through the crowd. You’re at the end of the line giving feedback and direction to people on the line. The main idea of my thought is that the SP is just a random, helpful stranger in the middle of your crowd of neurons… He doesn’t care about your gradients or your backpropogation values at all. He’s just there doing his own thing while helping you out… At the same time, if you (during the backprop process) shout at all your neurons to tell them that they were wrong, he’ll take a moment to look at what he did, and maybe do less of that. If you’re shouting at all your neurons that they did great, again he’ll take a look at what he did, and strengthen that. The backprop stage, as far as this SP stranger is concerned, is just a moment for self reflection, to determine if what he did in this pass was helpful, or not. He doesn’t care about what gradients you assign to all the neurons in the bucket line, because it doesn’t apply to him. You don’t control him so much in that way. He only cares about himself, and the binary (maybe trinary “good”, “neutral”, “bad”) state of whether or not he helped or hurt the overall goal.
Imagine though, that you have one of your neurons at the front, receiving end of the line, with a whole bunch of these SP strangers helping out… at that point, you would have a little bit of game theory starting to take place.
Maybe a better analogy would be to think of a college classroom. Your traditional DL neurons would be graded students, who receive information, take tests/quizzes/assignments, and receive a grade. Based on that grade, they adjust appropriately (low grades inspired harder work, high grades will adjust very little). Then you have those folks who are auditing the class. They might participate in discussions and other activities, contributing to the overall success of the classroom experience, but they’re essentially self-monitoring, where an instructor might take a quick look and say “That looks good” or “That’s not right.”, but no percentage or grade is assigned to any of the work.
The whole purpose of backpropagation is to just move the probability in the right direction. If you know the direction you need if the input is above or at the inflection point, then you give it the amount you want to adjust. Likewise for below. Sure mathematically a vertical lines slope is undefined or infinite, but we aren’t looking for a slope of on or off, we are looking for the slope to correct towards a maximum of 1 and a minimum of 0 given a threshold or inflection point of 0.4 or whatever. We don’t need to stick ourselves in a box, we can just account for the possible states and correct towards them.
For a moment I didn’t get what you were saying but then I realized. I was making the assumption you would only chose the winners after the backpropagation and in forward propagation it would just be normal matrix multiplication with linear activation. But it sounds like your thoughts are you chose the winners AS the activation function. Which I hadn’t considered. That’s an interesting take on how it might work. I was thinking the winners would be calculated by how close they were to the answer, I’ve been in a different mindset of softmax selection and winner take all algorithms for the past 2 and half months. So that’s where my bias was coming from.
After thinking about it just right now, I can see how either one could mess up the math. I’d have to do physical tests to see what the results would be. I’m running up to the edge of running the scenarios in my head. There is a lot of stuff I haven’t accounted for, which I suppose is fine. It’s mostly just Numenta research paper conspiracy theories rattling around up there at this point . I’m gonna be writing on the walls and eating scooby snacks here pretty soon.
But I’m too lazy to test any of this out since it’s already been done and I have a million other things I need to do anyways, so I’ll just wait for the paper to come out and read it then. Anyways have a nice night.