Introducing:
SimHash Distributed Scalar Encoder (SHaDSE)
A LocalitySensitive Hashing approach towards encoding semantic
data into Sparse Distributed Representations, ready to be fed
into an Hierarchical Temporal Memory, like NuPIC by
Numenta. This uses the SimHash algorithm to
accomplish this. LSH and SimHash come from the world of
nearestneighbor document similarity searching.
This encoder is sibling with the original Scalar Encoder,
and the Random Distributed Scalar Encoder (RDSE). The static
bucketing strategy here is generally lifted straight from the RDSE,
although the â€ścontentsâ€ť and representations are created differently.
Instead of creating a random hash for our target bucket, we first
generate a SHA3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variablelength hash (n
)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius
), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (w
) is achieved during final
collapsing step of SimHashing.
Scalar Encoder Feature Comparison
Original Scalar 
Random Distributed 
SimHash Distributed 


Collisions Allowed 
No  Yes  Yes 
Semantic Continuity 
Physical Adjacency 
Custom Random Map 
SimHashing 
Periodic Wrapping 
Yes  No  Yes^{1} 
Lookup Tables 
No  Yes  No 
Topology Supported 
Yes  No  No 
Required Params 
minval maxval 
resolution  resolution 
Other Params 
radius resolution periodic 
bucketRadius periodic 
Scalar Encoder Performance Comparison
Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual â€śHotGymâ€ť
reccenterhourly.csv
data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.
Original Scalar 
Random Distributed 
SimHash Distributed 


Settings  minval=0 maxval=100 
resolution=0.4  resolution=0.25 
MAPE  0.327  0.300  0.294 
MAE  5.599  5.438  5.954 
RMSE  9.257  8.838  10.296 
NLL 1k  3.396  2.626  3.368 
Time  45.32s  46.06s  65.71s^{1} 
Distance^{2}  2 (stable)  2 (stable)  410 (variable) 
Â Â Â ^{2}Average Hamming distance between two adjacent buckets
How It Works
Step 1  Input some scalar values:
Time  Input 

0  27.7 
1  10.2 
2  16.3 
3  16.4 
4  7.6 
5  22.9 
Step 2  Map inputs to buckets:
Map input values to bucket indexes, using the
formula from the RDSE.
bucketIndex = (
(bucketsWidth / 2) + int(round(
(input  bucketOffset) / resolution)
)
)
Input  Bucket 

27.7  5 
10.2  2 
16.3  3 
16.4  3 
7.6  1 
22.9  4 
Step 3  Hash bucket index
Hash bucket index value of a target bucket (3), and
neighbors (bucketRadius
=2).
Bucket  Hash 

1  00110101 
2  11011010 
3  01110111 
4  00011011 
5  10101100 
Step 4  Convert binary 0â€™s in hashes to integer 1:
Bucket  Hash Columns 

1  1 1 +1 +1 1 +1 1 +1 
2  +1 +1 1 +1 +1 1 +1 1 
3  1 +1 +1 +1 1 +1 +1 +1 
4  1 1 1 +1 +1 1 +1 +1 
5  +1 1 +1 1 +1 +1 1 1 
Step 5  Weight bucket hashes:
Weight bucket hashes. The target bucket (3) in center of
bucketRadius
is heaviest.
Bucket  Weight  Hash Columns 

1  1  1 1 +1 +1 1 +1 1 +1 
2  2  +2 +2 2 +2 +2 2 +2 2 
3  3  3 +3 +3 +3 3 +3 +3 +3 
4  2  2 2 2 +2 +2 2 +2 +2 
5  1  +1 1 +1 1 +1 +1 1 1 
Step 6  Sum weighted binary columns:
Bucket  Hash Column Summations 

3  3 1 1 7 1 1 5 3 
Step 7  Collapse integer sums back to binary:
Collapse the sums back to binary for a final SimHash value for our
target bucket (3).
A regular SimHash will change all sums >= 0
to a binary 1
, while
all sums < 0
are changed to a binary 0
. This SimHash will usually
result in about 50% sparsity.
We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here, w
=3):
Bucket  Sparse SimHash 

3  0 0 0 1 0 0 1 1 
This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to itâ€™s
near neighbors (descending outward with bucketRadius
).
Source Code
 Pull Request against Old NuPIC with all the final goods.
 Research Repo with my test runners and original research code.
Next Steps
 Thanks for any feedback.
 I would love any Code Reviews, as this is my first real Python
project.  Should I add Python tests? Capâ€™nâ€™proto Serialization? I wasnâ€™t
sure if that would be wasted work or not.  Create C++ version for Community NuPIC.cpp (I plan on doing this).
 Find a way to lower and stabilize the average Hamming distance
between buckets, but guarantee it doesnâ€™t bottom out.
This video seems helpful, but is beyond me.  Adaptive versions to compare against Adaptive Scalar Encoder?
 What about other LocalitySensitive Hashing algorithms
and methods, like MinHash? 
Math Readers: Does LocalityPreserving Hashing also
apply? Would it add support for topology?  It seems to me like the SimHash algorithm could
easily be performed biologically, in accordance with our
current theories. I canâ€™t help but wonder if this is somehow
related to how the neocortex encodes and pools.