And I don’t mean just going on intuition. From the perspective of functional gradient descent, can we ensure the steps of the TM algorithm coincide with a greedy minimization of a well-defined loss function? Can we use margin theory to ensure the generalization behavior of TM is in line with the problem of minimizing streaming SDR prediction error? Can we use decision theory to ensure the predictions of column activation are used in a way that is consistent with minimizing the expected error of unseen samples?
Without the aid of theoretical justification, we’re arguably just flying by the seat of our pants with new development in HTM, no?
I have just disgusted the same issue with my professor. But we ended up with some major problems.
First, the prediction error is given by num_bits_not_predicted/num_bits_should_be_predicted. Which means the easiest way to get 0 prediction error is to simply predict every bit to be 1. That gives us a useless model but guarantees 0 error.
Secondly, TM is not differentiable. That makes varifing so much harder.
Just out of curiosity, do these same objections apply to using Recurrent Neural Networks and Markov chains?
If there are good theoretical tools for RNN why don’t they apply in any way to HTM?
I see both as fitting in the temporal-predictive domain. Both are required to capture some aspect(s) of a past stream and fit it to the current stream to select a best-fit match.
Yet it works; that suggest that the problem lies in the tools.
Why?
Real neurons work more like HTM than ANN models. Noting that real nerves don’t fit our tools very well suggests that the tools are in need of a re-think - for all of the reasons you just outlined.
It may be useful to get back to basics and ask what these tool is supposed to be measuring and how that maps to the domain being analyzed.
I am an embedded systems programmer and I assure you that what gets presented to my system input frequently does not look like nice clean continuous values, sometimes it is ugly binary. In these cases I have to step back and ask - what is the end product I would get if it was a well behaved value? I usually find that thinking different (like set memberships for example) give me access to different tools that allow some sort of clean (and all important in an embedded system - verifiable) functions.
I guess my point is this. To put it plainly, surely HTM networks optimize something. We know HTM learns something because it’s performance improves over time. Anything that “learns” is optimizing something. That’s how we measure if it has learned anything of value or not. It’s learned knowledge is measured by the difference in performance of the task after that learned knowledge effects its behavior.
In the case of RNNs, the optimization function is well-defined. We know precisely what the RNN is optimizing. That is powerful knowledge all on it’s own. That leads to plenty of insight like intuitive explanations of its inner workings and theoretical guarantees. Furthermore we understand how it does its optimization; in such a way that we can ensure every step of the algorithm is consistent with that overall optimization goal. Does the theory explain everything? Hell no. But it’s enough to work with comfortably.
By contrast, there seems to be no understanding of what HTM networks as a whole are actually optimizing much less how. That is problematic. HTM makes deep nets look like Sesame Street in terms of explainability. With new developments in HTM (sensorimotor theory, grid cells, etc), the central question I find myself always asking is this: how does anyone know for sure that this is a better course of action? With other algorithms that boast sound theoretical justifications, it’s possible to answer that question definitively. We can divert our faith to the mathematics. Sure, you can test HTM empirically all you want. But empirical evidence is not enough for obvious reasons. Is it a problem with our tools? Very possibly. Maybe brand new tools need to be developed to truly explain HTM and frame it mathematically in a comprehensive way. That seems to be of utmost importance to me. It seems to be a step that should precede any significant algorithmic changes. Because again, without it, how can we possibly know for sure that any changes are even consistent with a clearly defined goal?
You can say “well because that’s how it works in the brain!” But come on…let’s not kid ourselves. Yes, HTM networks are vastly more biologically plausible than ANNs and I fully agree that that is the right direction to push AI materially ahead. But even HTM is still a gross simplification of real neurons and real neural networks. There are still many mysteries about how real neurons operate, anyway. “Because there is evidence of it in neuroscience” is by no means a rock-solid justification. It takes faith to trust that justification. It’s unscientific. It is deferring justification to a field which itself is heavily lacking completeness. We all know how often results in neuroscience change, get disputed and sometimes become irrelevant over time.
P.S. I’m in the pre-infant stages of devising a PhD project. These are the kinds of questions I’m interested in working on. I’d love nothing more than to help devise a biologically plausible model that also has theoretical rigor. The way I see it, the lack thereof is by far the biggest stopping block HTM (and other bio plausible models) face to becoming adopted in the larger AI community.
To address the basic question of ‘what does HTM optimize’ in its learning process, it minimizes the TM anomaly scores. The TM sees a new sequence (say ‘A,B,C,D’) and bursts at each non-predicted input, learning the transitions so that when that sequence repeats the activated SP columns do not burst and the anomaly scores are 0 instead of 1.
It’s true that if one wanted to minimize all anomaly scores in the greediest way, you could fill each neuron with dendrites connecting all over, so that every input basically predicts every other possible input. The TM algorithm however learns in a very selective and sparse way, by forming dendrites that generate predictions to minimize the anomaly scores. So the TM is sort of being greedy, but doing so within this mechanism of sparsely forming & modulating synaptic connections. This is how HTM can be quite effective at One Shot Learning while maintaining very sparse structures, yielding high learning capacity and minimizing false positives.
I apologize in advance for adding another answer based on intuition rather than a mathematical proof. Thought I would add it to the discussion in case it might trigger some ideas.
I think what HTM is optimizing toward is a specific target count of predicted active cells in each time step. This count is based on the target sparsity level (i.e. one predicted active cell per active minicolumn), and addresses the point that @marty1885 mentioned about predicting every bit to be 1.
The tools listed for DL assume the network is describing a line, plane, volume, or a higher dimensional surface.
The figure in question may be contagious or multiple contiguous areas or volumes.
The NN is attempting to define the surface of this shape, with an error function that defines the goodness of the fit. The tools available are drawn from the geometry/calculus domain.
The HTM canon describes set membership. These sets do not have to be constrained by geometry, and in fact, usually are not. The correct tools to describe the properties of an HTM network should more properly be drawn from set theory. Venn diagrams come to mind. Perhaps combinatorics are also involved here.
The temporal parts of HTM describe a trajectory through set space.
I do not have the mathematical background to identify the best tools to describe and analyze these networks but I suspect that that a quick visit to the math department will turn up some good answers.
I would say that, at least, two things are happening in HTM where one is mapping out the domain that is available through sensory input and the other is training of dendrites to minimize spurious activation and maximizing the probability of activation at a specific point in a sequence of sensory inputs. I claim that the mapping is done through the creation of dendrites and that the training of created dendrites is separate from the mapping.
As such, any analysis of what HTM does, and how, should separate these two mechanics.
I’m most likely misunderstanding your “how” but, to me, nothing in the training relies on anything new or revolutionary. Showing that the training of dendrites, as done in HTM, minimizes spurious activations and maximizes the probability of activations at a specific point in a sequence of sensory inputs comes across as covered ground. In the context of HTM and the mathematical underpinnings, I’m just a hobbyist and not a mathematician so maybe I’m overlooking some complicating factor.
Why is this problematic? What is the actual problem?
What is obvious? If empirical evidence (or experimenting as I see it) is the precursor to actually understanding what is going on, it is what is needed.
Why? If there is an already working system (the human brain) there is no reason to not update the algorithms to reflect new findings. Especially if the change follows serious attempts to formalize the new findings and they fit well enough with what is already established.
I’m going out on a limb here, but, I have not seen any attempts to define a clear goal apart from using the human brain as inspiration for a system that exhibits intelligent behaviour. If a clearly defined goal exists, I’d like to know what it is.
If a gross simplification can generate the kind of results as HTM can, refining and expanding it sounds like the sane way forward. As far as I know, the goal isn’t to replicate the human brain but to build a system that can do brain-like things.
I’m sure we are all grateful that early scientists pushed forward even though their fields lacked completeness. New ideas can always be formulated and tested and I view the current imaging techniques that let us map even the actual structure of individual dendrites as what the introduction of the microscope was to germ theory. It opens the door to a big realm of new ideas that can be tested.
This is a very interesting question, and I myself is trying to express HTM using math (e.g. mostly probabilistic).
What is your motivation for comparing TM to gradient descent? Have you somehow done any research about HTM’s problem-solving nature or properties in the optimization/search domain? Say, is it doing any local/global search, is it heuristic/meta, its convergence, what did you find?