NEW: SimHash Distributed Scalar Encoder (SHaDSE)



SimHash Distributed Scalar Encoder (SHaDSE)

A Locality-Sensitive Hashing approach towards encoding semantic
data into Sparse Distributed Representations, ready to be fed
into an Hierarchical Temporal Memory, like NuPIC by
Numenta. This uses the SimHash algorithm to
accomplish this. LSH and SimHash come from the world of
nearest-neighbor document similarity searching.

This encoder is sibling with the original Scalar Encoder,
and the Random Distributed Scalar Encoder (RDSE). The static
bucketing strategy here is generally lifted straight from the RDSE,
although the “contents” and representations are created differently.

Instead of creating a random hash for our target bucket, we first
generate a SHA-3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variable-length hash (n)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (w) is achieved during final
collapsing step of SimHashing.

Scalar Encoder Feature Comparison

No Yes Yes
Random Map
Yes No Yes1
No Yes No
Yes No No
resolution resolution
   1Encodings for wrapped edge buckets will adapt with bucket growth

Scalar Encoder Performance Comparison

Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual “HotGym”
rec-center-hourly.csv data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.

Settings minval=0
resolution=0.4 resolution=0.25
MAPE 0.327 0.300 0.294
MAE 5.599 5.438 5.954
RMSE 9.257 8.838 10.296
NLL 1k 3.396 2.626 3.368
Time 45.32s 46.06s 65.71s1
Distance2 2 (stable) 2 (stable) 4-10 (variable)
   1Speed can easily be brought up with internal lookup tables
   2Average Hamming distance between two adjacent buckets

How It Works

Step 1 - Input some scalar values:
Time Input
0 27.7
1 10.2
2 16.3
3 16.4
4 7.6
5 22.9
Step 2 - Map inputs to buckets:

Map input values to bucket indexes, using the
formula from the RDSE.

  bucketIndex = (
    (bucketsWidth / 2) + int(round(
      (input - bucketOffset) / resolution)
Input Bucket
27.7 5
10.2 2
16.3 3
16.4 3
7.6 1
22.9 4
Step 3 - Hash bucket index

Hash bucket index value of a target bucket (3), and
neighbors (bucketRadius=2).

Bucket Hash
1 00110101
2 11011010
3 01110111
4 00011011
5 10101100
Step 4 - Convert binary 0’s in hashes to integer -1:
Bucket Hash Columns
1 -1 -1 +1 +1 -1 +1 -1 +1
2 +1 +1 -1 +1 +1 -1 +1 -1
3 -1 +1 +1 +1 -1 +1 +1 +1
4 -1 -1 -1 +1 +1 -1 +1 +1
5 +1 -1 +1 -1 +1 +1 -1 -1
Step 5 - Weight bucket hashes:

Weight bucket hashes. The target bucket (3) in center of
bucketRadius is heaviest.

Bucket Weight Hash Columns
1 1 -1 -1 +1 +1 -1 +1 -1 +1
2 2 +2 +2 -2 +2 +2 -2 +2 -2
3 3 -3 +3 +3 +3 -3 +3 +3 +3
4 2 -2 -2 -2 +2 +2 -2 +2 +2
5 1 +1 -1 +1 -1 +1 +1 -1 -1
Step 6 - Sum weighted binary columns:
Bucket Hash Column Summations
3 -3 1 1 7 1 1 5 3
Step 7 - Collapse integer sums back to binary:

Collapse the sums back to binary for a final SimHash value for our
target bucket (3).

A regular SimHash will change all sums >= 0 to a binary 1, while
all sums < 0 are changed to a binary 0. This SimHash will usually
result in about 50% sparsity.

We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here, w=3):

Bucket Sparse SimHash
3 0 0 0 1 0 0 1 1

This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to it’s
near neighbors (descending outward with bucketRadius).

Source Code

Next Steps

  • Thanks for any feedback.
  • I would love any Code Reviews, as this is my first real Python
  • Should I add Python tests? Cap’n’proto Serialization? I wasn’t
    sure if that would be wasted work or not.
  • Create C++ version for Community NuPIC.cpp (I plan on doing this).
  • Find a way to lower and stabilize the average Hamming distance
    between buckets, but guarantee it doesn’t bottom out.
    This video seems helpful, but is beyond me.
  • Adaptive versions to compare against Adaptive Scalar Encoder?
  • What about other Locality-Sensitive Hashing algorithms
    and methods, like MinHash?
  • Math Readers: Does Locality-Preserving Hashing also
    apply? Would it add support for topology?
  • It seems to me like the SimHash algorithm could
    easily be performed biologically, in accordance with our
    current theories. I can’t help but wonder if this is somehow
    related to how the neocortex encodes and pools.

Learn More

ANN : Scalar Matrix-Hashing ?encoder?
Monotonic grey code representations of numbers
Also: SimHash Document Encoder

Hi Brev,

This encoder is interesting! If I understand it correctly, the goal of this encoder is to greatly increase the radius of the encoding, and to have a non-linear semantic similarity, like in this sketch?

Should I add Python tests?

I wrote the unit tests for the RDSE in the community-fork of nupic.cpp. The unit tests check for semantic similarity by running through a range of values and measuring the overlap between consecutive output SDRs. Then it checks the min,mean,max of the overlaps. It does this test twice: once with a small step which results in a high semantic similarity between inputs, and again with a large step which results in no similarity between inputs.
Essentially it makes & checks the above graph, except that it does it many times and it only checks two points on the curve.

C++ version

C++ is not necessary for acceptance into the community fork. Python is acceptable.

Cap’n’proto Serialization … Adaptive Scalar Encoder

I wouldn’t bother with either of these. The community fork version of nupic has already dropped capnproto. The adaptive scalar encoder is obsoleted by the RDSE.


I would recommend against using a cryptographic hash function. It’s not critical but here are some potential issues with hash functions:

  1. Crypto hashs are intentionally time consuming to compute. It makes it harder to brute force attack them.
  2. Crypto hashs are typically seeded with random numbers, such as the system time. It makes it harder to guess the outputs of the hash (which makes it more secure). However if you save an HTM, reboot your computer, and load the HTM back, then the encoder will be different because the seed which the hash uses has changed.

P.S. Here is my goto reference about the encoding process:


Hi @dmac, thanks for the reply. The goal is really the exact same as the RDSE, it just accomplishes it via a different method. By tuning some of the Encoder parameters, one could maybe accomplish something like you have in your graphic, but it was not a goal.

These are very helpful details, thanks!

Oh, great, I will file a PR as a first step, until the C++ verison is ready.


I was wondering about that! Any ideas how I can generate a variable-length hash digest, if I’m not using SHAKE? I’ll check out the link. Thanks!


Very interesting, thanks for posting.
After reading the description of the algorithm. How does SimHash fair against 1D Grid Cell? Maybe I’m wrong but sounds like they have similar properties. How do they differ?


Thanks for sharing your interesting encoder. I am very interested in knowing for which tasks/application the new encoder is better than the existing encoders? Any hint?


Great job @brev! The biggest drawback of the RDSE, IMHO, is it’s lack of periodicity.

1 Like

1 Like

The community fork has a copy of MurmurHash3 which works well and is used by the RDSE. See also:

1 Like

Great question, I was wondering if there might be some overlap with grid cells, but I have no idea myself. thanks.


So far, I don’t know that it will perform better/worse - probably about the same. It’s more about alternate ideas and methods of accomplishing the encoding. thanks.


Hmm, I’m only seeing 32, 64, 128, and 160-bit output sizes available. An encoding is usually 400-bits wide, so I need the variable length output somehow.


Hi @rhyolight, thanks for taking a look!

Reminds of the Coordinate Encoder

Yes, I’ll be thinking about that (and hex grid cells) more. There may be more interesting things in that direction.

Lookup tables needed for decoding

I was excited that this new encoder could do it’s job without saving any state in memory. But adding lookup tables could be helpful for reconstructive decoding work, and speeding it up.

Hamming distance changes with resolution?

Yes, and with changes to bucketRadius, also.

With the Scalar and RDSE encoders, the hamming distance between buckets is perfectly and manually crafted.

With SimHashing, the hamming distance between buckets is more rough, coarse, and random, because it’s based on the statistical distribution of the input hashes. But, even though it’s not as pretty, it will generally approach the same state as the other encoders.

It’s perhaps more like growing an encoding, than building one.


I forgot to mention: With this SimHash Scalar encoder, if you set bucketRadius=0, it can also act like a category encoder.

And, I created a separate Document encoder with this same SimHash method. More here:

1 Like

@brev ís your c++ version available for testing? Thanks