Ideas for better word encoding

Ideas for better word encoding …

I’m testing different variations of encoding WORDS (of characters) and passing them trough bbHTM:SPooler to see what happens.

Currently I tried 3 variants :

  1. Use 5 bits for all possible 26 combinations (dense)
  2. One-hot 26 bit vector
  3. 3 active bits out of 78 bits per character (i.e. category encoder)

So far the option 3 behaves the best…
I compare the SP generated word-SDR and I get for example :

13 bits overlap for “interface” and “intersection”, which is good … the interesting part is that I get zero-overlap for
"management" and “measurement”, you can see why if you put them one below the other :



Do you see how the second example “slip” by one character :slight_smile:

The question is do you have some other ideas for Encoder that will capture the slippage ?!

What about letter ngrams? That’s what I use for my project.
Here is the python to convert a string to ngrams:

  def create_letter_n_grams(s,N):
    for i in range(len(s)-N+1):
      yield s[i:i+N]

Using 1,2,3 letter ngrams I get these similarity results:
45.8% interface, intersection
47.8% management, measurement

1 Like

That’s good, but I want to pass to SP encoded-word not encoded-string !
Also how do you convert them to binary ? what encoding do you use for the n-grams ? If I use 2-grams I no longer can encode it with 26-category encoder ( per char), because there are too many 325! combinations !