It would be great to have a history section. Somewhere that would summarise the lineage of ideas associated with HTM. I stumbled over the paper A cortical sparse distributed coding model linking mini- and macrocolumn-scale functionality which introduces the concept of SDC (sparse distributed code) which at first glance seems to be the same idea as an SDR. Can someone give me a brief summary of where the SDR idea as per HTM comes from? Thanks.
Hi Mark. That’s my paper from 2010. The SDR model described in it, TEMECOR, is now called Sparsey, was originally described in my 1996 Thesis. There are several major differences between Sparsey’s SDR format, now called Modular Sparse Distributed Codes (MSDCs) and Numenta’s (which I believe evolved from Kanerva’s approach). First, as the name implies, an MSDC coding field is modular, whereas to my knowledge, all others (Numenta’s, Kanerva’s), use a flat coding field (difference described here). A second major difference is that Sparsey does not pipeline the processing into a spatial pooler followed by a temporal pooler. Rather, both types of information, spatial input from the current time step and temporal context from prior time steps (over extremely long time windows) are combined simultaneously, on each time step, T, to determine the MSDC activated at T. In general, the number of sources that can be simultaneously combined is not limited to just spatial (bottom-up) and temporal (recurrent, or “horizontal”), but can be arbitrarily large, e.g., including top-down. There are many other differences as well, which I’m happy to discuss if you are interested. -Rod Rinkus
Hi Rod,
It is a great opportunity to communicate with you, thanks for responding. Sorry for my delayed reply - I wanted to read your paper. I am less interested in the biological mapping and more interested in the algorithm - but that is not to take away from your work, it is just a lot of effort to follow those details and it is a fast moving target. It would be great to explore the history in more detail.
Your paper starts with “No generic function for the minicolumn – i.e., one that would apply equally well to all cortical areas and species – has yet been proposed.” but at that time HTM was already postulated by Jeff Hawkins?
I really like the idea of using the local knowledge of similarity within the minicolun (mC). I also like the use of noise/randomization. This feels like something strange enough to provide a different perspective on how computation can be achieved by embracing what we typically avoid (e.g. sparsity and randomism).
I’m surprised you did not stress the correlation of reward with the G function. That was novel?
In your model, is each neuron connected to all inputs of the mC?
Now, 10 years on, “Is the proposal that the L2/3 cells engage in two rounds of competition in each computational (putatively, gamma) cycle plausible?” Is there now evidence for multiple winners per mC in some situations?
It seems to me that your approach would be more suitable than Numenta’s SDR for the types of applications corticol.io develops.
Do you agree that Numenta’s HTM have a similar concept of enforcing macrocolumn (MC) sparsity by k-winner voting in populations of mC?
Have you integrated a predictive aspect into Sparsey? I guess that is the major contribution of Numenta - but I would also like to hear more about that put into a broader perspective.
Thanks!
Hi Mark,
Yes HTM was postulated but I don’t believe there was a specific functional claim for the minicolumn in HTM at that point. Note that It was only in 2008-9 that Numenta rewrote the core model to use SDR. Prior to that, I believe it used Dileep George’s core model (essentially a Bayesian belief net) which used a localist representation. I just looked back at the 2009 paper PLOS paper “Towards a Mathematical theory of Cortical Micro-circuits” and see that while minicolumns are alluded to, in Figs 9 and 10, their internal mechanism is not elaborated. In contrast, the internal mechanism of a minicolumn, i.e., that its L2/3 portion implements a WTA function, was explicit in Sparsey from 1996 (when Sparsey was called TEMECOR).
Thanks, yes the idea of using the local knowledge of similarity (G) is essential to Sparsey. BTW, a similar concept was recently described in Dasgupta et al 2018 with respect to the fly olfactory system, though they do not propose using the novelty signal to control the amount of noise (randomness) in the process of choosing the SDR. Thus, noise is also of course essential to Sparsey. The third thing that is essential here is the explicit modularity of the coding field. That is, it is essential that the decision as to the overall SDC occurs as Q formally (mechanistically) independent decisions, i.e., one in each of the Q WTA minicolumns. I think it’s actually the conjunction of these three ingredients, SDC, (variable) noise, and modularity (within each individual coding field), that places Sparsey way outside the mainstream. BTW, I have a short bioarxiv paper focusing on explaining the core principle of adding novelty-contingent noise to achieve approximate similarity preservation. And I’ve just made github project public, a java GUI app that allows one to experiment with Sparsey’s core algorithm to see how different parameters affect the preservation of similarity.
Of course, reward/punishment is crucial as well. But it’s a different measure than novelty per se. As defined, G, is an inverse novelty, more precisely, a familiarity, measure. Adding an explicit scalar reward input to Sparsey would be simple, I’ve just never gotten around to it, but it’s on the list
Yes, each unit in the input field is connected to all units in the coding field (“macrocolumn”). I realize that this is a much denser connectivity than for overall cortex, but it’s only locally dense. And, the theory still works even if the local connectivity is less than full, e.g., 70%. So I don’t think this is damning regarding Sparsey’s relevance to biology, but in any case, it’s definitely not damning for hardware realizations of the model (e.g., on crossbars of memristors).
Regarding the requirement that some principal units need to compete twice in each computational cycle, it’s actually 25 years on! And to my knowledge, this has never been experimentally tested. But it is certainly well within biological plausibility. A gamma cycle is say 25-40 ms. A pyramidal can integrate its inputs an fire a spike in just a few, e.g., ~5, ms. Only one of the competing pyramidals in a WTA group (L2/3 portion of a minicolumn), actually needs to fire twice within the (local to a macrocolumn) gamma cycle. I sketch this basic neural operation in the 2010 paper. It’s only now that experimentalists are getting the tools to directly vet the theory: one needs to see what all units in the L2/3 volume of a macrocolumn are doing. Calcium imaging can show that, but it needs to be on a ms time scale and calcium is really too slow. Hopefully some voltage indicator with the spatial resolution of calcium will come along soon.
I haven’t kept abreast with cortical.io, but in general, Sparsey can find very long term time dependencies, based on single trials, for essentially any form of multivariate time series data.
Yes, Numenta’s k-winner approach is similar to Sparsey’s, i.e., similarly vast codespaces, but as stated in earlier response, Sparsey’s modular SDC field is computationally more efficient.
The first results for Sparsey (in the 1996 thesis) showed that after having experienced a large number of complex sequences (with single trials), prompting with the first item of any of the sequences could read out, i.e., predict, the rest of the items, with almost no errors. These results are summarized on this page. These were for sequences that were 10 items long, but the thesis has results for much longer sequences. My 2005 SFN poster, which generalizes Sparsey to the hierarchical case, shows a far more powerful prediction capability, i.e., the ability to recognize nonlinearly time-warped instances of the (single-trial) learned sequences, by combining the influence of horizontal recurrent signals from previously active SDCs at the same level with top-down recurrent signals from SDCs at the superjacent level (where the superjacent codes have longer persistences (time constants)).
Thanks very much for your interest and pointed questions.
-Rod
Hi Rod,
Thanks for the detailed response!
The story makes more sense with the Bayesian approach associated with a localist representation.
Sparsey broke a lot of fresh ground. I assume Sparsey is expecting a dense input, similar to what an HTM encoder would output.
The reward input is interesting, I wonder if at the lowest level the reward should be “success in predicting inputs that were impacted by outputs”. It is unclear to me if Sparsey is used to generate behavior, for example simulating the movement of a sensor.
At a more abstract (meaningful) level I think rewards needs to be learnt. Given that the system has identified properties of the environment that it can manipulate with predictable outcomes, the next level of reward is a more abstract level of prediction, reward = successful anticipation. For example the machine could learn by copying, and then new behavior becomes part of its repertoire. I have the impression many people hope the machine discovers how to reach a reward - I think this misses the point about how we solve problems (basically we learn/copy).
The full connectivity of cells to macrocolumn input is a major difference with HTM as I understand it. Once the input field is bigger than a macrocolumn input this will raise similar questions. Personally I’m more interested in bio-inspired algorithms than biomimicry.
I’m not sure what cortical.io do now, but I was thinking of their semantic text processing, not about the time series data. It seems Sparsey would have nice properties for “fuzzy” matching semantic information.
I’m surprised that Sparsey is more computationally efficient if it has full connectivity of each cell to all MC inputs. It is not requiring the SP where it gains efficiency?
I really like the poster, it is great to actually see work on hierarchy. The H in HTM does not seem to get enough attention!
Hey Mark,
Just wanted to comment real quick. I think Rinkus forgot this post. He is usually busy and he has been trying to get back into refactoring some of his code lately. But I can comment on a couple of points.
Sparsey does not take in a dense input. You have to preprocess the incoming data before it is fed to a Sparsey model. There is no generic encoder as of yet. For images, we just use edge detection and skeletonize the input. For other types of input though, we would have to use something like the data encoders that Numenta has developed.
Rinkus and I have thought about how we could integrate some sort of reward based learning. But recently I have been thinking and I don’t think you need to have any rewards. I agree with your intuition about behavioral learning. Although I don’t think there is a need for any additional reward array. An overlap in input is really the reward already, so I think if you take in the experience and its code then map it to an action, I think that is sufficient as a reward. If you see that scenario again the most probable action would be the action whose weights overlap. So behavioral learning/mapping with no explicit rewards has been my recent thinking.
The next question though becomes, “Well, I don’t want to touch fire again, so how do I save the experience but not repeat the action.”. I haven’t talked to Rinkus about it, but I think we could save the weights as negative values or something. Currently the winning neurons are chosen by a softmax where the probabilities are based on the overlap between saved weights and the signal. So, if there were negative weights I think that would effectively be inhibition of a related bad action if the scenario is presented again.
As to the full connectivity of neurons, the macrocolumns in general are only viewing a portion of the input field. 1 mac won’t be connected to the whole input field, just a portion of it. Some of Rinkus’ papers show full connectivity for simple demonstrative models, but the practical application models have macs with overlapping fields but not the entirety of the input. The “full connectivity” is to say that the receptive field is fully connected with no explicit sparsity. The sparsity comes from other neurons competing and a neurons receptive field naturally stays sparse. This is how it differs from Numenta’s explicit sparse connectivity.
Currently though there is no set standard for what the column and neuron counts or receptive field sizes are. There is a lot of parameter swarming that needs to be done. But in general all of the macs as a whole will cover the entire input field, but the receptive fields of each mac may overlap with other mac fields but not cover the entire input field.
I’ll give Rinkus a shout and remind him about this post and he may comment more on it.
@Cairo thanks very much for your clarifications. Your remarks about avoiding an action make me wonder whether that happens at the same level of hierarchy. For example, if one level of the hierarchy learns about touching things and not touching things, then higher up the system learns when to implement one strategy or another. Then it would not be negative weights on the “touching” behavior but reinforcement of the “don’t touch” behavior.
I would like to clarify the idea of the “receptive field is fully connected.” Does each mac (which I assume is the abbrev. for macro-column) have a predefined connectivity to the receptive field in Sparsey? My impression of HTM (I am not sure because I’ve not played with a implementation) is that the full receptive field (associated with a mac) will also be covered by the mac. In HTM the sparsity is introduced by the spatial pooling (basically k-winner amongst subsets of the mini-columns in a mac) not at the receptive field. But maybe you will clarify my misunderstanding.
It seems to a major distinction between Sparsey and HTM is that Sparsey is maintaining a degree of invariant representation in the output of the mac, while HTM loses that and is more like an indication of a single context.
Sorry about that Mark, yeah when I say ‘mac’ I mean macrocolumn or hypercolumn.
So in HTM, a neuron is randomly connected to other neurons in a layer and that connectivity is sparse. So there will be neurons that it can never learn information from. In Sparsey, neurons in a macrocolumn are connected to neurons below in a predefined receptive field radius (or a calculated one at model initialization). Within that receptive field it is fully connected, anything outside that receptive field it can never learn from. There is also full connectivity to layers above and adjacent neighboring macs on the same layer, but that is for temporal context since you only read activity from the last timestep. It is highly recurrent connectivity.
A huge difference, and I think the most important one, is that HTM is unstructured, vs Sparsey being highly structured. HTM doesn’t have a notion of minicolumns or macrocolumns, at least not with the versions I had studied from about 2 or 3 years back now. Rinkus talks about the advantages of structure here: Structured SDR Coding field has big advantage of flat SDR field (sparsey.com)
This picture, also from his website, kind of shows how a mac (described as hypercolumn in the graphic) connects to neurons below:
You can imagine the green hexagons as other macs below. It works the same for the receptive field looking up and horizontally, but just with the last time stamps activity.
If you take a look at some of his papers on his website, sparsey.com, he describes a full macrocolumn structure. I’m putting together a Unity visualizer in my spare time so its easier to inspect a model, mostly for debugging purposes, but I imagine it will be useful for people to see how a full model works.
I’ll give HTM one thing though. It is an easier model to understand in my opinion. In fact I used some of the learning material to grasp some Sparsey concepts. Sparsey’s high recurrent connectivity makes it a brain twister for sure.
Just thought I’d provide some input on this point, in case it causes confusion to newbies on the forum
The concept of minicolumns is central to the TM algorithm, and has been so for quite longer that 2-3 years. For example, the HTM whitepaper from 2011 described them (in that paper they were simply referred to as “columns”), and I’ve seen them described by Jeff in videos as far back as 2009
Discovering the function of macrocolumns is in fact the central focus of HTM theory, and is a common theme in Numenta’s papers. Mountcastle’s proposal that all regions of the neocortex are fundamentally the same is in many ways a founding principle of HTM theory. It is in fact because of this almost singular focus on the function of a single macrocolumn that little focus has yet been devoted to the “H” in HTM.
An interesting bit of history is that the hypothetical role of the Hierarchy changed.
Everyone used to assume that the hierarchy of cortical regions had a close correspondence with a conceptual hierarchy of knowledge in the world. They grouped things into semantic categories, and put the categories into a hierarchy, and then matched that structure to the brain’s structure. For example: V1 might process lines and edges, V2 might process letters and numbers, and V4 might process whole words.
Now we think that each cortical region is much more powerful than previously assumed, and that a hierarchy of cortical regions is not strictly required for understanding hierarchical concepts. For example: V1 can understand anything that fits in a V1-sized the receptive field (RF). If a letter, number, or entire word fits in a V1 RF then it is processed in V1. In practice this happens when you look at a road sign from very far away.
Is there experimental evidence for that e.g. neurons in V1 correlating with words/concepts?
No that is a hypothetical thing that Mr. Hawkins discussed at some point. As Paul said: the hierarchy has received less research attention.
I think it is still important to point out the difference between a cortical column learning to recognize a complex object like a word on its own, versus understanding the meaning of that word. For example, it is well established that there are regions of the neocortex responsible for understanding language (damage those areas, and language is significantly impacted).
I can’t speak for others, but personally I think that hierarchy is still at play for learning very complex abstractions, even while accepting that a single cortical column may be a lot more capable than it has traditionally been viewed. In any case, I am always on the lookout for folks in this community who are experimenting with hierarchy.
That’s absolutely my understanding. There is some lack of clarity as to what scale we’re dealing with: 200 million columns of 100 neurones each, or 20 million columns of 1000 neurones each? Either way the mammalian neocortex has a functional repeating unit visible in the architecture which is central to HTM.
There is even less clarity around the intent of the H, which I suspect has drifted over time. But the core concept of CC as a computational unit along with SDRs have endured.
Sorry Paul,
It’s likely up for academic debate or semantics but I should clarify my position on my claim. I took a look over the HTM school stuff real quick again and the BAMI document and I am reminded why I was thinking there was no minicolumns in HTM.
In HTM, as I recall, an individual cell and lowest processing structure is called a “column”. A column connects itself to the input field and has weights. In most algorithms ranging from optimization based techniques to Sparsey and other brain inspired algorithms, this is called a neuron.
In Sparsey, an individual unit that has weights directly connected to the input, is a neuron/cell. In my code its a neuron. In Rinkus’ literature I believe he refers to it as a cell depending on the paper. Neurons that exists in a column together inhibit one another. Only one neuron/cell can be active.
Unlike HTM, neurons in a Sparsey column are defined and structured explicitly. In HTM the “columns” have a inhibition radius which in an abstract less structured way I suppose could be considered the closest thing to a Sparsey minicolumn.
As for the macrocolumn, I’m not sure of any structure that relates to Sparsey macrocolumns. Essentially though, Sparsey went the route of maintaining a more structured hierarchical model. I had linked to Mark a paper describing some of the reason for that.
Thanks.
No, a (mini)column in HTM is a collection of cells (typically around 32, but configurable) which share a receptive field, and are able to inhibit each other if they fire sooner than the others. An individual minicolumn represents a bit of the input space in all contexts. An individual cell within a minicolumn represents that bit of the input space in a specific context.
Anyway, you are right that the semantics of the two terms are likely different between the two frameworks. I was simply pointing out that concepts which use these terms exist in HTM (and in fact are fundamental), in case your comment about them not existing in HTM was confusing to anyone.
No, I think that’s a fair way off how Numenta and HTM look at things. The ‘lowest processing structure’ is a column, which is visible anatomically and consists of a lot of neurons. There are no ‘weights’, but there are SDRs. Rather than worrying about cell behaviour, the driver for HTM is the idea of:
- Discover operating principles of neocortex
- Build systems based on these principles
The HTM model is an attempt to do that. There is no requirement that it match the physical architecture of the neocortex, as long is can follow the same operating principles.
Hi Cairo, I think the spatial pooler (SP) operates at the scale of a macro-column (MC) in HTM.
There is a cheat sheet that defines some basic terms. You might find that helpful.
3 posts were split to a new topic: Lowest processing structure of HTM