We define the entropy that
where P(a_i) is defined by , which indicates the average activation frequency of the i’th mini-column during M input timesteps.
and the function curve is
From the paper "The HTM Spatial Pooler—A Neocortical Algorithm for Online Sparse Distributed Coding
", we know that:
The SP will have low entropy if a small number of the SP mini-columns are active very frequently and the rest are inactive. Therefore, the entropy metric quantiﬁes whether the SP eﬃciently utilizes all mini-columns.
Then there comes some doubts. It is obviously that when P(a_i) equals 0.5, the entropy becomes maximum. If we set the activation density to be 2% (i.e. the sparsity should become 2%), while there is some error causing the sparsity to be 50%, then the entropy will be much larger then the correct ones, and we say the SP eﬃciently utilizes all mini-columns. That is not reasonable, isn’t it?
Good skeptical thinking!
I think the assumption is that the SDR density that a SP generates is constant. Under such condition I don’t hink there is a way to exploit the formula.
Without boosting, SP will not efficiently utilize all mini-columns. And yes, we do set a use a constant activation sparsity throughout the process. We don’t change it as time passes or depending on what is being represented.
It is possible to “normalize” the entropy into the range 0-1. To do this divide by the entropy of the average activation frequence (this is either the hardcoded target freq OR it can be calculated from the data). A result of 1 or 100% indicates maximum utilization, and 0% means the program has serious problems. This normalization makes entropy into a useful debugging tool
Perhaps an interesting technique to fine-tune boosting.
Distantly related to this thread, folks have looked at classes of logic circuits that maintain the same number of “0-nodes” and “1-nodes” for stable power consumption.
What might be more relevant – if thinking about entropy as a metric correlated with efficiency – is whether higher layers of cognition follow some sort of Boltzmann distribution in physical count or functional activity.
For example, incoming audio at 16 bits of resolution and a 5 KHz cutoff – i.e. 10K samples per second – shrinks from 20Kbytes per second to about 20 bytes per second if reduced to a single voice speaking.
And, intuitively contrasting today’s speech recognition with that of a couple of decades ago, there’s much more exception-archiving and real-time comparison today.
Which, to this non-expert, is somewhat how a child learns. A couple of dozen or hundred approximate rules to get the gist – and then fine-tuning to the mainstream word and idiom levels of understanding.