NEW: SimHash Distributed Scalar Encoder (SHaDSE) - DEPRECATED

brev · April 19, 2019, 2:25am

Introducing:

SimHash Distributed Scalar Encoder (SHaDSE)

A Locality-Sensitive Hashing approach towards encoding semantic
data into Sparse Distributed Representations, ready to be fed
into an Hierarchical Temporal Memory, like NuPIC by
Numenta. This uses the SimHash algorithm to
accomplish this. LSH and SimHash come from the world of
nearest-neighbor document similarity searching.

This encoder is sibling with the original Scalar Encoder,
and the Random Distributed Scalar Encoder (RDSE). The static
bucketing strategy here is generally lifted straight from the RDSE,
although the “contents” and representations are created differently.

Instead of creating a random hash for our target bucket, we first
generate a SHA-3+SHAKE256 hash of the bucket index (the
SHAKE extension provides a variable-length hash (n)). Using that
hash, and hash values from nearby neighbor buckets (within
bucketRadius), we then create a weighted SimHash for our target
bucket index. This SimHash output will represent both individual bucket
value, and represent the relations between nearby neighbor values in
bucket space. A stable number of On bits (w) is achieved during final
collapsing step of SimHashing.

Scalar Encoder Feature Comparison

	Original Scalar	Random Distributed	SimHash Distributed
Collisions Allowed	No	Yes	Yes
Semantic Continuity	Physical Adjacency	Custom Random Map	SimHashing
Periodic Wrapping	Yes	No	Yes¹
Lookup Tables	No	Yes	No
Topology Supported	Yes	No	No
Required Params	minval maxval	resolution	resolution
Other Params	radius resolution periodic		bucketRadius periodic

¹Encodings for wrapped edge buckets will adapt with bucket growth

Scalar Encoder Performance Comparison

Tests were run against the simple model presented in the
NuPIC Docs Algo Tutorial. The usual “HotGym”
rec-center-hourly.csv data was used (first 3000 rows, as
per tutorial code). The body of that article contains all settings
used for the Spatial Pooler, Temporal Memory, and SDR Classifier.
Any changes to Encoder settings are noted below.

	Original Scalar	Random Distributed	SimHash Distributed
Settings	minval=0 maxval=100	resolution=0.4	resolution=0.25
MAPE	0.327	0.300	0.294
MAE	5.599	5.438	5.954
RMSE	9.257	8.838	10.296
NLL 1k	3.396	2.626	3.368
Time	45.32s	46.06s	65.71s¹
Distance²	2 (stable)	2 (stable)	4-10 (variable)

¹Speed can easily be brought up with internal lookup tables
²Average Hamming distance between two adjacent buckets

How It Works

Step 1 - Input some scalar values:

Time	Input
0	27.7
1	10.2
2	16.3
3	16.4
4	7.6
5	22.9

Step 2 - Map inputs to buckets:

Map input values to bucket indexes, using the
formula from the RDSE.

  bucketIndex = (
    (bucketsWidth / 2) + int(round(
      (input - bucketOffset) / resolution)
    )
  )

Input	Bucket
27.7	5
10.2	2
16.3	3
16.4	3
7.6	1
22.9	4

Step 3 - Hash bucket index

Hash bucket index value of a target bucket (3), and
neighbors (bucketRadius=2).

Bucket	Hash
1	00110101
2	11011010
3	01110111
4	00011011
5	10101100

Step 4 - Convert binary 0’s in hashes to integer -1:

Bucket	Hash Columns
1	-1 -1 +1 +1 -1 +1 -1 +1
2	+1 +1 -1 +1 +1 -1 +1 -1
3	-1 +1 +1 +1 -1 +1 +1 +1
4	-1 -1 -1 +1 +1 -1 +1 +1
5	+1 -1 +1 -1 +1 +1 -1 -1

Step 5 - Weight bucket hashes:

Weight bucket hashes. The target bucket (3) in center of
bucketRadius is heaviest.

Bucket	Weight	Hash Columns
1	1	-1 -1 +1 +1 -1 +1 -1 +1
2	2	+2 +2 -2 +2 +2 -2 +2 -2
3	3	-3 +3 +3 +3 -3 +3 +3 +3
4	2	-2 -2 -2 +2 +2 -2 +2 +2
5	1	+1 -1 +1 -1 +1 +1 -1 -1

Step 6 - Sum weighted binary columns:

Bucket	Hash Column Summations
3	-3 1 1 7 1 1 5 3

Step 7 - Collapse integer sums back to binary:

Collapse the sums back to binary for a final SimHash value for our
target bucket (3).

A regular SimHash will change all sums >= 0 to a binary 1, while
all sums < 0 are changed to a binary 0. This SimHash will usually
result in about 50% sparsity.

We can further sparsify a SimHash for our needs by collapsing down
based on a certain number of the highest sums (here, w=3):

Bucket	Sparse SimHash
3	0 0 0 1 0 0 1 1

This SimHash encoded representation will identify both our specific
target bucket, and the relation of our target bucket to it’s
near neighbors (descending outward with bucketRadius).

Source Code

Pull Request against Old NuPIC with all the final goods.
Research Repo with my test runners and original research code.

Next Steps

Thanks for any feedback.
I would love any Code Reviews, as this is my first real Python
project.
Should I add Python tests? Cap’n’proto Serialization? I wasn’t
sure if that would be wasted work or not.
Create C++ version for Community NuPIC.cpp (I plan on doing this).
Find a way to lower and stabilize the average Hamming distance
between buckets, but guarantee it doesn’t bottom out.
This video seems helpful, but is beyond me.
Adaptive versions to compare against Adaptive Scalar Encoder?
What about other Locality-Sensitive Hashing algorithms
and methods, like MinHash?
Math Readers: Does Locality-Preserving Hashing also
apply? Would it add support for topology?
It seems to me like the SimHash algorithm could
easily be performed biologically, in accordance with our
current theories. I can’t help but wonder if this is somehow
related to how the neocortex encodes and pools.

Learn More

dmac · April 19, 2019, 2:16pm

Hi Brev,

This encoder is interesting! If I understand it correctly, the goal of this encoder is to greatly increase the radius of the encoding, and to have a non-linear semantic similarity, like in this sketch?

Should I add Python tests?

I wrote the unit tests for the RDSE in the community-fork of nupic.cpp. The unit tests check for semantic similarity by running through a range of values and measuring the overlap between consecutive output SDRs. Then it checks the min,mean,max of the overlaps. It does this test twice: once with a small step which results in a high semantic similarity between inputs, and again with a large step which results in no similarity between inputs.
Essentially it makes & checks the above graph, except that it does it many times and it only checks two points on the curve.

C++ version

C++ is not necessary for acceptance into the community fork. Python is acceptable.

Cap’n’proto Serialization … Adaptive Scalar Encoder

I wouldn’t bother with either of these. The community fork version of nupic has already dropped capnproto. The adaptive scalar encoder is obsoleted by the RDSE.

SHA-3+SHAKE256

I would recommend against using a cryptographic hash function. It’s not critical but here are some potential issues with hash functions:

Crypto hashs are intentionally time consuming to compute. It makes it harder to brute force attack them.
Crypto hashs are typically seeded with random numbers, such as the system time. It makes it harder to guess the outputs of the hash (which makes it more secure). However if you save an HTM, reboot your computer, and load the HTM back, then the encoder will be different because the seed which the hash uses has changed.

P.S. Here is my goto reference about the encoding process: https://arxiv.org/pdf/1602.05925.pdf

brev · April 20, 2019, 3:20am

Hi @dmac, thanks for the reply. The goal is really the exact same as the RDSE, it just accomplishes it via a different method. By tuning some of the Encoder parameters, one could maybe accomplish something like you have in your graphic, but it was not a goal.

These are very helpful details, thanks!

Oh, great, I will file a PR as a first step, until the C++ verison is ready.

I was wondering about that! Any ideas how I can generate a variable-length hash digest, if I’m not using SHAKE? I’ll check out the link. Thanks!

marty1885 · April 20, 2019, 4:48pm

Hi,
Very interesting, thanks for posting.
After reading the description of the algorithm. How does SimHash fair against 1D Grid Cell? Maybe I’m wrong but sounds like they have similar properties. How do they differ?

thanh-binh.to · April 21, 2019, 1:27pm

Thanks for sharing your interesting encoder. I am very interested in knowing for which tasks/application the new encoder is better than the existing encoders? Any hint?

rhyolight · April 22, 2019, 4:34pm

Great job @brev! The biggest drawback of the RDSE, IMHO, is it’s lack of periodicity.

rhyolight · April 22, 2019, 9:44pm

dmac · April 23, 2019, 12:55am

The community fork has a copy of MurmurHash3 which works well and is used by the RDSE. See also: MurmurHash - Wikipedia

brev · April 23, 2019, 1:42am

Great question, I was wondering if there might be some overlap with grid cells, but I have no idea myself. thanks.

brev · April 23, 2019, 1:43am

So far, I don’t know that it will perform better/worse - probably about the same. It’s more about alternate ideas and methods of accomplishing the encoding. thanks.

brev · April 23, 2019, 1:51am

Hmm, I’m only seeing 32, 64, 128, and 160-bit output sizes available. An encoding is usually 400-bits wide, so I need the variable length output somehow.

brev · April 23, 2019, 2:22am

Hi @rhyolight, thanks for taking a look!

Reminds of the Coordinate Encoder

Yes, I’ll be thinking about that (and hex grid cells) more. There may be more interesting things in that direction.

Lookup tables needed for decoding

I was excited that this new encoder could do it’s job without saving any state in memory. But adding lookup tables could be helpful for reconstructive decoding work, and speeding it up.

Hamming distance changes with resolution?

Yes, and with changes to bucketRadius, also.

With the Scalar and RDSE encoders, the hamming distance between buckets is perfectly and manually crafted.

With SimHashing, the hamming distance between buckets is more rough, coarse, and random, because it’s based on the statistical distribution of the input hashes. But, even though it’s not as pretty, it will generally approach the same state as the other encoders.

It’s perhaps more like growing an encoding, than building one.

brev · April 24, 2019, 5:01am

I forgot to mention: With this SimHash Scalar encoder, if you set bucketRadius=0, it can also act like a category encoder.

And, I created a separate Document encoder with this same SimHash method. More here:

thanh-binh.to · May 13, 2019, 6:38pm

@brev ís your c++ version available for testing? Thanks

brev · July 10, 2019, 3:57am

Hi all,

It turns out I missed some important historical context when it comes to Scalar Encoders, and just wanted to close the loop:

In Section 3.2 of Encoding Data for HTM Systems, @scott lists a method of hash-based scalar encoding, accomplished by simply hashing the indexes of a simple scalar encoding. I was so focused on content hashing when I read it that I missed the focus on index hashing, and missed the whole point. It’s a simple and beautiful solution (probably ideal for scalars without min/max).

Luckily, @dmac was paying attention, and he used this algorithm when building the random scalar encoder that is already working in in htm.core. This has all the same benefits of my SimHash Scalar Encoder, but is more simple, quick, and ideal. (Please note that this algo is different from the RDSE of Old NuPIC, which has no hashing).

So, my SimHash Scalar Encoder is deprecated. Just use what is already in htm.core.

I still plan on building my SimHash Document Encoder for htm.core in the near future, as it still seems useful. I’ll update in that thread when I have something to show (cc @thanh-binh.to).

Also, just today I learned of the Sampling Linear Scalar Encoder by @mrcslws, which seems to be a similarly ideal solution for encoding scalars which do have min/max requirements.

Topic		Replies	Views
Looking for some assistance with hashing functions Implementations question	2	506	May 30, 2020
Also: SimHash Document Encoder NuPIC	3	727	April 27, 2019
ANN : Scalar Matrix-Hashing ?encoder? Engineering	8	1432	April 19, 2019
RDSE implementation is wasting buckets? Numenta Theory	11	2658	July 10, 2019
SimHash Document Encoder now live in HTM.core (C++, Python) NuPIC Community Fork	1	797	October 21, 2019