HTM "Sparsity" vs Reciprocal of Information Density?

In HTM “sparsity” of a binary matrix is defined the way “density” is defined in sparse matrix representation:

HTM sparsity = w/n
WHERE
w = count of 1 bits
n = count of all bits

In HTM “density” is also used in a different way from sparse matrix representation terminology, to mean incompressibility or information density. However, to say that HTM sparsity = 0.5 maximizes information density isn’t correct since all of the 1’s could be separated from the 0’s.

Would the reciprocal of information density be a useful metric for HTM theory?

If so, would it be as useful as “HTM sparsity”?

If so, might it be better, for the purpose of consilience with other fields, to define HTM sparsity as the reciprocal of information density?

3 Likes

Good question.

Clearly, HTM sparsity is not enough, since it says nothing about the “topology” of the information.
It is nothing else that a counting. It is not about information, it is about the support of information only.

What is the definition of information density?
Is this the amount of information contained (nut how?) by unity in time in a sparse vector or matrix?

Proposed definition:

Information density of an encoding is its Kolmogorov complexity divided by its capacity.

Capacity is the time integral of Shannon’s capacity of the channel transmitting the encoding.

Thank you.
I was a bit scared that someone will reply with Kolgomorov complexity.
The Kolgomorov complexity is about something else.
The problem is the following:
What is the difference between (let’s say 10 elements)
1000000001
0100000010
Clearly, the difference is enormous.

What would the K program for those two sequences look like and why would one be much different than the other?

Could you point me to a definition of Kolmogorov complexity that effectively deals with short encodings?

My understanding is that Kolmogorov complexity is only theoretically relevant when the number of bits in the encoding is much longer than the program required to emulate one Turing machine equivalent on another. For instance, a LISP interpreter is 90 lines of Python – about 4kB. Therefore a relevant length would be much larger than 4kB.

I know of one way short encodings might be addressed by a more recent take on Kolmogorov complexity:

Marcus Hutter’s AIXI theory (circa 2000) reified something he called a “natural Turing machine” as the basis of Kolmogorov complexity which is, itself, the basis of Solomonoff induction.

This “natural” Turing machine gets around the constant additive factor of 4k (or thereabouts) and permits one to compare short K programs without bias.

The “natural Turing machine” is, in some sense, the “simplest” computer (Turing machine equivalent). Defining the degree of simplicity of a computer does seem to beg the question but it has intuitive appeal.

I think I may have a way to actually define Hutter’s natural Turing machine in operational terms – which is to say it must involve physical laws so that the computer can emulate itself, minus some memory capacity and speed. The length of the emulation program – on that computer – would be a measure of that computer’s “natural Kolmogorov complexity”.

Having said that, I don’t see that such a minimal computer would result in greatly differing lengths of programs outputting the two short encodings:

1000000001
0100000010

Indeed, it seems the shortest programs for those strings would be, of identical length:

“1000000001"
and
"0100000010”

Thank you for the reply.
I will check the reference you mention.
The intuition here that HTM sees only specific sparse information arrays. Those arrays must be encoded, of course, by an encoder, and you are right here.
But once the arrays are encoded, the notion of complexity used in HTM becomes different.
I guess you refer monstly to the encoding process.

Honestly, you guys are talking over my head. I’m not sure what you are discussing is directly related to HTM theory. Is there a suggestion buried in here? Or are we just trying to define terminology?

My question originates in my interest in finding consilience with information theoretic treatments of artificial general intelligence – in particular Solomonoff induction and, ultimately, Hutter’s AIXI which formally unifies sequential decision theory (aka “supervised learning”) with Solomonoff induction (aka “unsupervised learning”) in what is sometimes called a “gold standard” abstract definition of AGI.

Understand that neither Solomonoff induction nor AIXI are, in any way, shape or form, competitive with the techniques of machine learning such as HTM. They are a theoretic framework that operate at a higher level of abstraction within which machine learning techniques may be made commensurable and thereby facilitate interdisciplinary cross-fertilization.

The complaint has been lodged against HTM that it lacks mathematical rigor. I cannot vouch for this complaint but if it has merit, there may be theoretic merit in this discussion beyond mere terminology.

Indeed, we’ve heard that complaint since we took our code open source. In response, we have been publishing a lot of papers recently, many of which include mathematical explanations some aspects of HTM theory.

The core of the issue is that HTM is not a mathematical construct, but a neurobiological one. It is hard to “prove” with maths. We can show statistically that it is working in some ways, but there is really no proof to be done that I can see. Many people coming from an ML background cannot accept that.

Anyway, if you think this conversation can help quantify HTM mathematically in some way, carry on! :smile:

I’ll look at the papers for possible relations with AIXI.

Looking at the problem of compressing the sparse distributed representations, there may need to be elaboration of algorithmic information theory (based on Kolmogorov complexity) to render it more relevant to the real world.

For instance, let’s say you have a large enough sample of SDRs to make general statements about their optimal compression comparing run length encoding with vectors of coordinates. Kolmogorov complexity’s definition does not permit one to factor out the specific algorithm’s length for a given context, such as HTM. Although it may well exist, I know of no notion of “relative Kolmogorov complexity” that would reify the metric of the encoding length limited to the RLE representation or the vector of coordinates representation. I can see how such a metric may be more relevant to HTM and if such a metric does not exist in the literature, it may be an area of theoretic importance justifying some effort by the AGI community.

I guess this is what I don’t understand. Why is the compression of SDRs interesting in this scenario? Our brains don’t compress SDRs, it doesn’t seem to be a necessary component of intelligence in the cortex. Compression is just an engineering problem. We only do it when we want to pass SDRs between nodes or serialize them. During the actual running of the algorithm, there is no compression (subsampling, yes, but no compression).

1 Like

Let me caveat that by saying that we do store some SDRs in memory by index simply for efficiency, but it is not a component of the algorithm, just an optimization for our code.

One of the key insights of Solomonoff induction is optimal lossless compression (K program) leads to optimal prediction based on the implied model. Optimal prediction is defeated by overfitting and confirmation bias – both of which are, themselves, defeated by optimal lossless compression. This has been born out empirically (See Jurgen Schmidhuber’s work with recurrent neural networks in which compressed representation of weights are used for large networks).

I don’t know whether HTMs suffer from overfitting when there are too many neurons being thrown at a fairly simple problem with noise but that is something I’d want to see measured as well as seeing the argument for how overfitting is avoided without pruning.

Confirmation bias can be thought of as throwing away the “noise” (ie: lossy compression) in one’s metrics. That “noise” frequently turns out to be underfitting. One model’s noise is frequently another model’s prediction. To compare them fairly requires keeping a measure of the error commensurable with the measure of the model’s complexity.

Now, having said all that, and recognizing that Schmidhuber was one of Hutter’s mentors – moreover recognizing that Schmidhuber’s LSTM RNNs deal with temporal structure, I think Hawkins’s strategy (and that of Hecht-Nielsen) is the better one:

Test and refine models of the neocortical unit of computation in practical applications.

Ultimately I believe we’ll find that the best model of the neocortex will do some form of lossless compression of its inputs in order to optimally predict the next inputs and optimally predict consequences of actions.

1 Like

I’m having an interesting time experimenting with single layer nets at the moment with binary “features” followed by a floating point (FP) weighting and summation. I find that learns far faster and better than with FP features (which is kind of surprising.) Also I am trying sparse binary features with recurrence. What I see is the effect of the recurrence dissipating quickly, as the system goes round more and more misses occur. In contrast sparse and non-sparse FP features and non-sparse binary features often cause the system to tumble around chaotically, indefinitely. A brain that persists in chaos is probably not so good. A small amount of chaos that exists for only a short time could be preferable.