Firstly, to the mods, I’m not certain whether this falls into HTM Theory, or HTM Hacking. Due to the practical focus I chose Hacking. Please feel free to move if required.

To everyone else, the preamble is fairly long and if you wish you may skip it. I am mostly coming to this from an uneducated perspective however and thus wanted to provide some background to help clarify possible peculiarities in my question. It mostly contains context about my abilities (lack thereof really) and what I’m trying to do. I’m hoping to use this as my “learning” thread to keep track of what I’ve done, and hopefully someone else might find it useful as well.

Starting my background, I have no grounding in math and cannot read mathematical notation beyond multiplication and division. I find I can usually understand math when converted to programming syntax though complex algorithms are still beyond me. I understand some C and Lua, and can read Python with some effort. I never finished school and have no tersiary education. And lastly, I’m simply a hobbiest programmer and thus not very good. I have however been considered competent enough for an internship in kernel development and although I was sweating blood trying to learn everything (I was responsible for producing a working graphics driver), I did make some progress until the company failed.

As for what I’m trying to accomplish, SDRs and HTMs caught my attention (yesterday), and I’m trying to make sense of how they work and why they work from a practical perspective. I am thus attempting to implement my own limited version. The reason I’m asking about encoding specifically is because HTMs are based on SDRs, SDRs represent other data, said data need to be somehow encoded into well formed SDRs, thus encoders are the logical starting point.

What I am hoping to accomplish with this, I’m trying to incrementally learn how these sysems work by coding them. I’ve read a fair amount of theory, however I don’t understand how to place that into a practical understanding. Also, without this practical grounding other parts of the theory are beyond my reach. Chicken, meet egg.

So, getting to my first questions (I have a lot):

What is “semantic similarity” defined as exactly?

Why does “related data” end up in similar positions in an SDR?

For example taking the MD5 sum (the BaMI paper suggested a deterministic hash function), of the words “tiger” and “jaguar” will not give a representation that is in *any* way similar to each other. Adding “is a car”, and “is a cat” will not improve the situation at all. The scalar encoder demonstrated does not help either as it assumes that the SDR has a fixed representation. Position one will always relate to 1 and thus the learning system needs to know this. This understanding is clearly lacking as one of the key points of SDRs is that one can use the same algorithms to operate on them regardless of what they encode. The way I’m understanding them seem more like a cute way of encoding ints which does not seem to be accurate at all.

Is it a requirement that there be a specific number of “ON” bits in an SDR. For example, no more, and no less than ten. If this is so, how on earth does one achieve that?

I’ll stop here for now. Thanks to all that’s read so far, it’s greatly appreciated.