I wouldn’t be sure, that any sensory input can be characterised as sparse. In vision perceiving an edge, I believe, represented as quite a dense stream of signals from the retina, only in the brain it becomes sparse.
It’s not the same with cochlea: the stream can be sparse at any given moment because it’s first of all temporal sequence. But vision encodes spatial patterns and temporal sequences are used only to combine them to more complex patterns (like exact face and the whole scene at the end).
There are several levels of processing between photosensitive cells in the retina and the cortex, including structures outside the eye such as the thalamus. By the time it his the cortex it has become reasonably sparse.
When we engineer biomorphic sensors for vision, the output is indeed sparse, since this is what we understand biological vision is doing as well. See below for an example of a biomorphic vision sensor that outputs pretty sparse events.
It looks imressiv. Is this input used for recognition? It’s super stable and 3D-reconstruction is very interesting, but it’s hard to imagine that such data is useful for recognition using HTM.
Professor Scaramuzza’s group has been using these event-based cameras for geometric reconstruction, mapping, localizing, recognizing locations, and controlling flying robots with great success.
I am not aware of anyone using these sensors with HTM, although it seems like a ripe opportunity to test HTM on some pretty realistic sensory data. The topic has been suggested in the following thread:
Do I understand it correctly, that this camera provides the same information about edges that can be received with usual cameras but with close to zero latency?
Not exactly. It is a two-dimensional pixel array like a traditional camera, but each pixel is independent of the others, and triggers an event whenever the pixel intensity increases or decreases by some threshold, say 5%. These events are triggered at a very fine time resolution, effectively on the order of 10,000 Hz. The events have no scalar value however, they only indicate a positive or negative change in intensity. So if the camera is stationary you’ll see events caused by moving objects, and if the camera is moving you’ll see events caused by moving objects and stationary edges.
Also, since each pixel is its own autonomous unit, these cameras exhibit dynamic ranges many orders of magnitude greater than traditional cameras. Intuitively it means if you pointed it at a bright light, only the pixels on the light would get saturated and the rest could function as normal, unlike a traditional camera where the global exposure control would make all of the other pixels very dark in response to the light.
These properties are similar to retinal sensing, in which all-or-nothing spike events are triggered in response to changes in illumination, and sensitivity is adapted near-independently across retinal “pixels”.
This is getting away from the original topic, but the intent was to indicate that input to the brain is often very sparse.
I figured out an answer to the second question: with the current algorithm of selecting the active neurons, the ones which are sensitive to spatially smaller patterns, in most cases lose a competition to neurons sensitive to less specific, but bigger patterns.
I believe it’s possible to create a boosting algorithm which can solve this issue (I’m not talking about existing boosting here).
@rhyolight, @scott, is there any experience or ideas in this direction?
Those cameras are just the best sensor for testing sensorimotor concept where camera must slightly move for better seeing any non-moving objects like retina, Unfortunately, they provide asynchron events so that we have to modify Htm for handling them or must convert sensor data somehow in the regular time instance
A common post-processing step is to integrate events over a short synchronous window, e.g. 1ms.
Thank you for more detailed explanation, I didn’t get it after looking at some of the recommended by you materials. It really similar to the retina behavior, especially if it is possible to support receptors sensitive to the overcoming a threshold up and down at the same time (I mean by two different groups of receptors).
At the same time, unlike your camera, the retina has very uneven distributions of receptors, and there are much more cones in the fovea, which is responsible for detailed vision. So, here representation of the solid edge should be quite dense.
In any case, retina itself is just a part of the visual encoder, and I’m not sure about the sparsity of the resulting input to the neocortex. I think it should be quite dense just to be economically reasonable, and I know it can be dense because it works for me.
What I see from my experience, it is more important how to organize input to maximize semantics and capability for generalization (isomorphism in case of visual data), then support low sparsity.
@jakebruce it is clear that we can collect all events within a time window like 1ms. However, you have to face with many frames with no events e.g for scences in your garden midnight. We discussed this topic sometime ago, but from my understanding the current nupic does not support it.
Just wanted to say that you gave a very good answer to your own question. Also, to prevent confusion of other readers, I think you meant active columns rather than neurons.
@jakebruce The video and approach seems a very good fit for HTM. Lately I was thinking about a visual sensor that just captures the edges (I can access environment geometry) or color change in visual data to sparsify the input. The RGB color sensor I am using at the moment has fixed sparsity but it is not sparse at all and I think I am crippling HTM because of that. Columns need to map to a very large subset of the input bits. A sparse visual sensor became one of the priorities for the agent at the moment. I can detect the changes in intensity as well as discussed above. Thanks for the direction.
I am looking for the simplest starting point. Would I be good if I just turned on bits that changed intensity value by a threshold? Also, do we apply inhibition to the neighboring pixels? If not, why not?
It’s a good point in general, but in this exact case, since I was talking about SP only, it was about neurons or input for the correspondent columns, depending on how would you like to look at it.
Could you elaborate on it, what do you mean by that? Every element of SP is potentially connected to a fixed number of input elements. So it shouldn’t be any difference from this perspective is it sparse representation or not.
Below are just observational thoughts which may be off.
It makes a difference on the overlaps of representations for similar inputs. If a column learns dense patterns, then each column is actually encoding more of the whole image (input space) rather than bits and pieces. So if you change the image slightly, either the active columns do not get effected or almost all of them change. Situation worsens as you increase density. In addition, every column of the same activation starts representing the same stuff. The things they represent overlap more as the density increases. So you lose distributedness on your representations. You can shrink the size of potential fields of columns to limit what they learn, but then you are not using all the information on your input which leads to underfitting, if your possible input patterns are rich enough.
Of course you can adjust some SP parameters to remedy some of this, which is what I do mostly. So there is that. However, just because the brain can recognize objects in chaotic images, it does not mean that its ability is fully utilized. That’s my general concern with dense inputs.
Oh, I forgot another reason for sparsity specific to my case; performance. Dense representations cost more because of the increased number of synapses per column to encode them.
How it’s possible? The part of the input space which covered by proximal connections is determined by percent of potential connections you use, so it doesn’t matter is it sparse or not.
Let’s say you use 70% of potential connections, so you initially connect each of SP’s element to 70% of all elements in your input space. If 1% or 80% of these connections are active, it is still the representation of 70% of your input space.
It confuses me too. At the level of columns, you always have fix sparsity (usually 2%). So you should have the same number of distal connections for sparse and dense input, and it can’t affect your performance. Unless you use your own implementation with not-fixed sparsity in SP.
Efficient implementations usually only consider the bits in the input that are on. Sparser input saves computational cost in that case.
At the level of SP you’ll have 2% of active neurons in any case, so how more dense input can affect computations in TM?