SimHash Distributed Scalar Encoder (SHaDSE)
A Locality-Sensitive Hashing approach towards encoding semantic
data into Sparse Distributed Representations, ready to be fed
into an Hierarchical Temporal Memory, like NuPIC by
Numenta. This uses the SimHash algorithm to
accomplish this. LSH and SimHash come from the world of
nearest-neighbor document similarity searching.
This encoder is sibling with the original Scalar Encoder,
and the Random Distributed Scalar Encoder (RDSE). The static
bucketing strategy here is generally lifted straight from the RDSE,
although the “contents” and representations are created differently.
Instead of creating a random hash for our target bucket, we first
generate a SHA-3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variable-length hash (
n)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (
w) is achieved during final
collapsing step of SimHashing.
Scalar Encoder Feature Comparison
Scalar Encoder Performance Comparison
Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual “HotGym”
rec-center-hourly.csv data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.
|Distance2||2 (stable)||2 (stable)||4-10 (variable)|
2Average Hamming distance between two adjacent buckets
How It Works
Step 1 - Input some scalar values:
Step 2 - Map inputs to buckets:
Map input values to bucket indexes, using the
formula from the RDSE.
bucketIndex = ( (bucketsWidth / 2) + int(round( (input - bucketOffset) / resolution) ) )
Step 3 - Hash bucket index
Hash bucket index value of a target bucket (3), and
Step 4 - Convert binary 0’s in hashes to integer -1:
|1||-1 -1 +1 +1 -1 +1 -1 +1|
|2||+1 +1 -1 +1 +1 -1 +1 -1|
|3||-1 +1 +1 +1 -1 +1 +1 +1|
|4||-1 -1 -1 +1 +1 -1 +1 +1|
|5||+1 -1 +1 -1 +1 +1 -1 -1|
Step 5 - Weight bucket hashes:
Weight bucket hashes. The target bucket (3) in center of
bucketRadius is heaviest.
|1||1||-1 -1 +1 +1 -1 +1 -1 +1|
|2||2||+2 +2 -2 +2 +2 -2 +2 -2|
|3||3||-3 +3 +3 +3 -3 +3 +3 +3|
|4||2||-2 -2 -2 +2 +2 -2 +2 +2|
|5||1||+1 -1 +1 -1 +1 +1 -1 -1|
Step 6 - Sum weighted binary columns:
|Bucket||Hash Column Summations|
|3||-3 1 1 7 1 1 5 3|
Step 7 - Collapse integer sums back to binary:
Collapse the sums back to binary for a final SimHash value for our
target bucket (3).
A regular SimHash will change all sums
>= 0 to a binary
< 0 are changed to a binary
0. This SimHash will usually
result in about 50% sparsity.
We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here,
|3||0 0 0 1 0 0 1 1|
This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to it’s
near neighbors (descending outward with
- Pull Request against Old NuPIC with all the final goods.
- Research Repo with my test runners and original research code.
- Thanks for any feedback.
- I would love any Code Reviews, as this is my first real Python
- Should I add Python tests? Cap’n’proto Serialization? I wasn’t
sure if that would be wasted work or not.
- Create C++ version for Community NuPIC.cpp (I plan on doing this).
- Find a way to lower and stabilize the average Hamming distance
between buckets, but guarantee it doesn’t bottom out.
This video seems helpful, but is beyond me.
- Adaptive versions to compare against Adaptive Scalar Encoder?
- What about other Locality-Sensitive Hashing algorithms
and methods, like MinHash?
Math Readers: Does Locality-Preserving Hashing also
apply? Would it add support for topology?
- It seems to me like the SimHash algorithm could
easily be performed biologically, in accordance with our
current theories. I can’t help but wonder if this is somehow
related to how the neocortex encodes and pools.