NEW: SimHash Distributed Scalar Encoder (SHaDSE) - DEPRECATED

Introducing:

SimHash Distributed Scalar Encoder (SHaDSE)

A Locality-Sensitive Hashing approach towards encoding semantic
data into Sparse Distributed Representations, ready to be fed
into an Hierarchical Temporal Memory, like NuPIC by
Numenta. This uses the SimHash algorithm to
accomplish this. LSH and SimHash come from the world of
nearest-neighbor document similarity searching.

This encoder is sibling with the original Scalar Encoder,
and the Random Distributed Scalar Encoder (RDSE). The static
bucketing strategy here is generally lifted straight from the RDSE,
although the “contents” and representations are created differently.

Instead of creating a random hash for our target bucket, we first
generate a SHA-3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variable-length hash (n)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (w) is achieved during final
collapsing step of SimHashing.

Scalar Encoder Feature Comparison

Original
Scalar
Random
Distributed
SimHash
Distributed
Collisions
Allowed
No Yes Yes
Semantic
Continuity
Physical
Adjacency
Custom
Random Map
SimHashing
Periodic
Wrapping
Yes No Yes1
Lookup
Tables
No Yes No
Topology
Supported
Yes No No
Required
Params
minval
maxval
resolution resolution
Other
Params
radius
resolution
periodic
bucketRadius
periodic
   1Encodings for wrapped edge buckets will adapt with bucket growth

Scalar Encoder Performance Comparison

Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual “HotGym”
rec-center-hourly.csv data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.

Original
Scalar
Random
Distributed
SimHash
Distributed
Settings minval=0
maxval=100
resolution=0.4 resolution=0.25
MAPE 0.327 0.300 0.294
MAE 5.599 5.438 5.954
RMSE 9.257 8.838 10.296
NLL 1k 3.396 2.626 3.368
Time 45.32s 46.06s 65.71s1
Distance2 2 (stable) 2 (stable) 4-10 (variable)
   1Speed can easily be brought up with internal lookup tables
   2Average Hamming distance between two adjacent buckets

How It Works

Step 1 - Input some scalar values:
Time Input
0 27.7
1 10.2
2 16.3
3 16.4
4 7.6
5 22.9
Step 2 - Map inputs to buckets:

Map input values to bucket indexes, using the
formula from the RDSE.

  bucketIndex = (
    (bucketsWidth / 2) + int(round(
      (input - bucketOffset) / resolution)
    )
  )
Input Bucket
27.7 5
10.2 2
16.3 3
16.4 3
7.6 1
22.9 4
Step 3 - Hash bucket index

Hash bucket index value of a target bucket (3), and
neighbors (bucketRadius=2).

Bucket Hash
1 00110101
2 11011010
3 01110111
4 00011011
5 10101100
Step 4 - Convert binary 0’s in hashes to integer -1:
Bucket Hash Columns
1 -1 -1 +1 +1 -1 +1 -1 +1
2 +1 +1 -1 +1 +1 -1 +1 -1
3 -1 +1 +1 +1 -1 +1 +1 +1
4 -1 -1 -1 +1 +1 -1 +1 +1
5 +1 -1 +1 -1 +1 +1 -1 -1
Step 5 - Weight bucket hashes:

Weight bucket hashes. The target bucket (3) in center of
bucketRadius is heaviest.

Bucket Weight Hash Columns
1 1 -1 -1 +1 +1 -1 +1 -1 +1
2 2 +2 +2 -2 +2 +2 -2 +2 -2
3 3 -3 +3 +3 +3 -3 +3 +3 +3
4 2 -2 -2 -2 +2 +2 -2 +2 +2
5 1 +1 -1 +1 -1 +1 +1 -1 -1
Step 6 - Sum weighted binary columns:
Bucket Hash Column Summations
3 -3 1 1 7 1 1 5 3
Step 7 - Collapse integer sums back to binary:

Collapse the sums back to binary for a final SimHash value for our
target bucket (3).

A regular SimHash will change all sums >= 0 to a binary 1, while
all sums < 0 are changed to a binary 0. This SimHash will usually
result in about 50% sparsity.

We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here, w=3):

Bucket Sparse SimHash
3 0 0 0 1 0 0 1 1

This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to it’s
near neighbors (descending outward with bucketRadius).

Source Code

Next Steps

  • Thanks for any feedback.
  • I would love any Code Reviews, as this is my first real Python
    project.
  • Should I add Python tests? Cap’n’proto Serialization? I wasn’t
    sure if that would be wasted work or not.
  • Create C++ version for Community NuPIC.cpp (I plan on doing this).
  • Find a way to lower and stabilize the average Hamming distance
    between buckets, but guarantee it doesn’t bottom out.
    This video seems helpful, but is beyond me.
  • Adaptive versions to compare against Adaptive Scalar Encoder?
  • What about other Locality-Sensitive Hashing algorithms
    and methods, like MinHash?
  • Math Readers: Does Locality-Preserving Hashing also
    apply? Would it add support for topology?
  • It seems to me like the SimHash algorithm could
    easily be performed biologically, in accordance with our
    current theories. I can’t help but wonder if this is somehow
    related to how the neocortex encodes and pools.

Learn More

7 Likes

Hi Brev,

This encoder is interesting! If I understand it correctly, the goal of this encoder is to greatly increase the radius of the encoding, and to have a non-linear semantic similarity, like in this sketch?

Should I add Python tests?

I wrote the unit tests for the RDSE in the community-fork of nupic.cpp. The unit tests check for semantic similarity by running through a range of values and measuring the overlap between consecutive output SDRs. Then it checks the min,mean,max of the overlaps. It does this test twice: once with a small step which results in a high semantic similarity between inputs, and again with a large step which results in no similarity between inputs.
Essentially it makes & checks the above graph, except that it does it many times and it only checks two points on the curve.

C++ version

C++ is not necessary for acceptance into the community fork. Python is acceptable.

Cap’n’proto Serialization … Adaptive Scalar Encoder

I wouldn’t bother with either of these. The community fork version of nupic has already dropped capnproto. The adaptive scalar encoder is obsoleted by the RDSE.

SHA-3+SHAKE256

I would recommend against using a cryptographic hash function. It’s not critical but here are some potential issues with hash functions:

  1. Crypto hashs are intentionally time consuming to compute. It makes it harder to brute force attack them.
  2. Crypto hashs are typically seeded with random numbers, such as the system time. It makes it harder to guess the outputs of the hash (which makes it more secure). However if you save an HTM, reboot your computer, and load the HTM back, then the encoder will be different because the seed which the hash uses has changed.

P.S. Here is my goto reference about the encoding process: https://arxiv.org/pdf/1602.05925.pdf

3 Likes

Hi @dmac, thanks for the reply. The goal is really the exact same as the RDSE, it just accomplishes it via a different method. By tuning some of the Encoder parameters, one could maybe accomplish something like you have in your graphic, but it was not a goal.

These are very helpful details, thanks!

Oh, great, I will file a PR as a first step, until the C++ verison is ready.

:+1:

I was wondering about that! Any ideas how I can generate a variable-length hash digest, if I’m not using SHAKE? I’ll check out the link. Thanks!

2 Likes

Hi,
Very interesting, thanks for posting.
After reading the description of the algorithm. How does SimHash fair against 1D Grid Cell? Maybe I’m wrong but sounds like they have similar properties. How do they differ?

3 Likes

Thanks for sharing your interesting encoder. I am very interested in knowing for which tasks/application the new encoder is better than the existing encoders? Any hint?

3 Likes

Great job @brev! The biggest drawback of the RDSE, IMHO, is it’s lack of periodicity.

1 Like

1 Like

The community fork has a copy of MurmurHash3 which works well and is used by the RDSE. See also: https://en.wikipedia.org/wiki/MurmurHash

1 Like

Great question, I was wondering if there might be some overlap with grid cells, but I have no idea myself. thanks.

So far, I don’t know that it will perform better/worse - probably about the same. It’s more about alternate ideas and methods of accomplishing the encoding. thanks.

Hmm, I’m only seeing 32, 64, 128, and 160-bit output sizes available. An encoding is usually 400-bits wide, so I need the variable length output somehow.

Hi @rhyolight, thanks for taking a look!

Reminds of the Coordinate Encoder

Yes, I’ll be thinking about that (and hex grid cells) more. There may be more interesting things in that direction.

Lookup tables needed for decoding

I was excited that this new encoder could do it’s job without saving any state in memory. But adding lookup tables could be helpful for reconstructive decoding work, and speeding it up.

Hamming distance changes with resolution?

Yes, and with changes to bucketRadius, also.

With the Scalar and RDSE encoders, the hamming distance between buckets is perfectly and manually crafted.

With SimHashing, the hamming distance between buckets is more rough, coarse, and random, because it’s based on the statistical distribution of the input hashes. But, even though it’s not as pretty, it will generally approach the same state as the other encoders.

It’s perhaps more like growing an encoding, than building one.

2 Likes

I forgot to mention: With this SimHash Scalar encoder, if you set bucketRadius=0, it can also act like a category encoder.

And, I created a separate Document encoder with this same SimHash method. More here:

1 Like

@brev ís your c++ version available for testing? Thanks

Hi all,

It turns out I missed some important historical context when it comes to Scalar Encoders, and just wanted to close the loop:

In Section 3.2 of Encoding Data for HTM Systems, @scott lists a method of hash-based scalar encoding, accomplished by simply hashing the indexes of a simple scalar encoding. I was so focused on content hashing when I read it that I missed the focus on index hashing, and missed the whole point. It’s a simple and beautiful solution (probably ideal for scalars without min/max).

Luckily, @dmac was paying attention, and he used this algorithm when building the random scalar encoder that is already working in in htm.core. This has all the same benefits of my SimHash Scalar Encoder, but is more simple, quick, and ideal. (Please note that this algo is different from the RDSE of Old NuPIC, which has no hashing).

So, my SimHash Scalar Encoder is deprecated. Just use what is already in htm.core.

I still plan on building my SimHash Document Encoder for htm.core in the near future, as it still seems useful. I’ll update in that thread when I have something to show (cc @thanh-binh.to).

Also, just today I learned of the Sampling Linear Scalar Encoder by @mrcslws, which seems to be a similarly ideal solution for encoding scalars which do have min/max requirements.

5 Likes