The SimHash Document Encoder is now live in HTM.core as C++ with Python bindings. It provides the simple and immediate encoding of text for use with HTM. This may be of interest to Natural Language Processing (NLP), Search, or HTM engineers.
The SimHash Document Encoder converts text-based documents into SDR encodings. Similar documents will result in similar encodings, while dissimilar documents will have differing encodings. “Similarity” here refers to bitwise similarity (small hamming distance, high overlap), not semantic similarity (encodings for “apple” and “computer” will have no relation here).
Usage
A wide selection of helpful parameters can be passed to the encoder, including options for setting token case sensitivity, vocabulary, weightings, exclusions, frequency ceiling/flooring, orphan handling, and character similarity sensitivity. The documentation in the header file has more details.
C++
The following is a usage example in C++:
#include <htm/encoders/types/Sdr.hpp>
#include <htm/encoders/SimHashDocumentEncoder.hpp>
SimHashDocumentEncoderParameters params;
params.size = 400u;
params.activeBits = 21u;
SDR output({ params.size });
SimHashDocumentEncoder encoder(params);
encoder.encode({ "bravo", "delta", "echo" }, output);
encoder.encode("bravo delta echo", output); // same
The C++ Unit Tests provide more usage examples.
Python
The following is a usage example in Python:
from htm.bindings.encoders import \
SimHashDocumentEncoder,
SimHashDocumentEncoderParameters
params = SimHashDocumentEncoderParameters()
params.size = 400
params.activeBits = 21
encoder = SimHashDocumentEncoder(params)
other = encoder.encode([ "bravo", "delta", "echo" ])
other = encoder.encode("bravo delta echo") # same
The Python Unit Tests provide more usage information.
Python Example Runner
An example of the encoder in action is provided in Python. It will generate many random documents, and find the most/least similar. It will also generate a visual chart of encoding space usage.
For help getting started:
python \
-m htm.examples.encoders.simhash_document_encoder \
--help
To run a simple example:
python \
-m htm.examples.encoders.simhash_document_encoder \
--size 400 \
--activeBits 150
Python Module Help
Helpful documentation on encoder parameters and usage is available in Python module form:
python
>>> import htm.bindings.encoders
>>> help(htm.bindings.encoders.SimHashDocumentEncoder)
Learn More
HTM.core
HTM.core is the active HTM Community fork of Numenta’s hibernating NuPIC HTM codebase. Thanks again to @breznak, @dmac, and @david_keeney from the team for their help and support, they’ve got a beautiful codebase going, and are wonderful to work with.
SimHash
SimHash is a Locality-Sensitive Hashing (LSH) algorithm from the world of nearest-neighbor document similarity search. It is used by the GoogleBot Web Crawler to find near-duplicate web pages.
We provide an encoder-specific README file for an in-depth tour of the SimHash algorithm.
Semantic Similarity
For encodings that do support semantic similarity (encodings for “apple” and “computer” will relate), @cogmission and the Cortical.io team offer their highly-recommended Semantic Folding technology.
Other Links
- Previous SimHash encoder research discussions:
- Original Article (permalink)