SimHash Document Encoder now live in HTM.core (C++, Python)

The SimHash Document Encoder is now live in HTM.core as C++ with Python bindings. It provides the simple and immediate encoding of text for use with HTM. This may be of interest to Natural Language Processing (NLP), Search, or HTM engineers.

The SimHash Document Encoder converts text-based documents into SDR encodings. Similar documents will result in similar encodings, while dissimilar documents will have differing encodings. “Similarity” here refers to bitwise similarity (small hamming distance, high overlap), not semantic similarity (encodings for “apple” and “computer” will have no relation here).

Usage

A wide selection of helpful parameters can be passed to the encoder, including options for setting token case sensitivity, vocabulary, weightings, exclusions, frequency ceiling/flooring, orphan handling, and character similarity sensitivity. The documentation in the header file has more details.

C++

The following is a usage example in C++:

#include <htm/encoders/types/Sdr.hpp>
#include <htm/encoders/SimHashDocumentEncoder.hpp>

SimHashDocumentEncoderParameters params;
params.size = 400u;
params.activeBits = 21u;

SDR output({ params.size });
SimHashDocumentEncoder encoder(params);

encoder.encode({ "bravo", "delta", "echo" }, output);
encoder.encode("bravo delta echo", output);  // same

The C++ Unit Tests provide more usage examples.

Python

The following is a usage example in Python:

from htm.bindings.encoders import \
  SimHashDocumentEncoder,
  SimHashDocumentEncoderParameters

params = SimHashDocumentEncoderParameters()
params.size = 400
params.activeBits = 21

encoder = SimHashDocumentEncoder(params)

other = encoder.encode([ "bravo", "delta", "echo" ])
other = encoder.encode("bravo delta echo")  # same

The Python Unit Tests provide more usage information.

Python Example Runner

An example of the encoder in action is provided in Python. It will generate many random documents, and find the most/least similar. It will also generate a visual chart of encoding space usage.

For help getting started:

python \
  -m htm.examples.encoders.simhash_document_encoder \
  --help

To run a simple example:

python \
  -m htm.examples.encoders.simhash_document_encoder \
  --size 400 \
  --activeBits 150

Python Module Help

Helpful documentation on encoder parameters and usage is available in Python module form:

python
>>> import htm.bindings.encoders
>>> help(htm.bindings.encoders.SimHashDocumentEncoder)

Learn More

HTM.core

HTM.core is the active HTM Community fork of Numenta’s hibernating NuPIC HTM codebase. Thanks again to @breznak, @dmac, and @david_keeney from the team for their help and support, they’ve got a beautiful codebase going, and are wonderful to work with.

SimHash

SimHash is a Locality-Sensitive Hashing (LSH) algorithm from the world of nearest-neighbor document similarity search. It is used by the GoogleBot Web Crawler to find near-duplicate web pages.

We provide an encoder-specific README file for an in-depth tour of the SimHash algorithm.

Semantic Similarity

For encodings that do support semantic similarity (encodings for “apple” and “computer” will relate), @cogmission and the Cortical.io team offer their highly-recommended Semantic Folding technology.

Other Links

5 Likes

Great accomplishment, Brev! I am proud of you!

1 Like