Set encoder

Hi. I’m trying to use HTM for predicting future sets in a sequence of sets.

Example: There are around 500 unique elements that can go in a set. In my particular case the number of elements in each set in a sequence can vary from 1 to 80 (extreme cases) with around 15 mean.

I’ve been wanting to try and build a simple custom encoder for this with 1 (or multiple) bit(s) per unique element. We can also assume that element pairs don’t have overlaping bits (no cross-item semantic similarity), although i might be interested in adding that eventually. I’ve read BAMI chapter on encoders and realized that one of the rules might make HTM completley unsuitable for my application:

“The output should have similar sparsity for all inputs and have enough one-bits to handle noise and
subsampling”

I wanted to ask here whether anyone has dealt with similiar problem and still managed to get decent results with HTM. Thanks in advance.

1 Like

I think it will have to be multiple bits. You’ll need at least enough bits to store 500 unique values, so that’s 9 bits (because 2^9==512). So say there are 80 items in a set, which means 720 (80*9) bits. That doesn’t seem unreasonable. That being said, I would be more comfortable with a sparser encoding (there are less random collisions that way).

Values will have randomly overlapping bits, but I understand that there are no semantic relationships between the set items. The sparser the encoding the less this will be a problem (making the encoding sparser would require increasing the input space size).

We can get around this by creating one more unique item in the set called “none” or something. We make the input size always the same (80), and any missing elements will be “none”. All set items will be encoded similarly so hopefully there will be consistent sparsity.

Some questions…

  1. should all potential set elements be considered of equal importance?
  2. does the order of the set matter at all?
  3. will the number of unique elements ever change?
  4. are you sure this data has no topology?

When i said 1 bit per unique item i meant that each bit represents whether an item is in a set or it isn’t. Order doesn’t matter.

This notion of having none item or none bits sounds good. Will definitly have to try it out.

Yes

No. I believe that by definition set doesn’t have an order (i.e. programing languages usually don’t guarantee any particular order when iterating a set).

no. Number of unique elements is fixed.

I tried looking it up but i’m afraid i don’t know exactly what to answer. I’m guessing i’m not well versed on what it means for data to be topological.

I see, so you image a 500 bit input space, each bit represents a possible item in the set’s existence. Makes sense, and is simple. Have you tried this yet? It’s true that the sparsity won’t be uniform because different numbers of items within the set occur over time, but you might still try it out.

If the set items truly have on relationship to each other, then there is no topology.

I’ll try it in the following days.

There isn’t any explicit relationship between items, the way it would be with scalars or dates. However, they can be split into a few groups, so perhaps i could use that in the future. Items within same group get some overlap… or perhaps even better, hierarchy of groupings where initially all items are a single group and then they split into less and less abstract groups (similiar idea to hierarchical clustering). That grouping could either be done manually by common sense or potentially clustering by co-appearance in a sequence. Lots of ideas come to mind, but initially i want to try with no topology asumption.

That sounds like a good idea. It’s not topology, just semantics. Bit it makes sense to encode like items similarly.