Also: SimHash Document Encoder

brev · April 24, 2019, 4:47am

Hi all,

I took my new SimHash Scalar Encoder, and was able to easily make a Document Encoder version (link below).

Similar documents will receive similar sparse encodings, and un-similar documents will have more unrelated outputs.

Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You’ll want http://cortical.io for that.

How it works:

Take a document, split up the words, and hash each.
You can optionally add weights to the word hashes of your document (in order to make big words more important in the output than small words).
Combine the hashes into a sparse SimHash for the document
(same method as SimHash Scalar encoder).

Notes:

This can be used as a category encoder if each document consists of a single word.
This shares a problem with the Coordinate Encoders: there is no Classifier yet (no predictions).

Source:

It’s now in the same Pull Request as my Scalar encoder on Old NuPIC.
My personal research repo
Article Permalink (this)

Next Steps:

Put Python version into Nupic.cpp, along with the SimHash Scalar encoder.
Rewrite those PY encoders to be C++ (also in Nupic.cpp)
Change each word from being a hash, to being a simhash created from hashes of each letter in the word. This way, near-spellings will be considered similar (“eat” vs. “eats”), which they currently are not.

thanks.

thanh-binh.to · April 24, 2019, 11:08am

@brev thanks for documentation. I like to test c++ version. When do you plan to release c++ version?

brev · April 27, 2019, 1:34am

Hi @thanh-binh.to, I hope within the next few weeks.

thanh-binh.to · April 27, 2019, 6:37am

@brev thanks

Topic		Replies	Views
SimHash Document Encoder now live in HTM.core (C++, Python) NuPIC Community Fork	1	795	October 21, 2019
NEW: SimHash Distributed Scalar Encoder (SHaDSE) - DEPRECATED NuPIC	14	1927	July 10, 2019
Looking for some assistance with hashing functions Implementations question	2	505	May 30, 2020
Future development of fork + how 'far behind' are we? NuPIC Community Fork question , community , development	9	1328	September 2, 2020
Community CategoryEncoder in python exist? NuPIC	8	856	April 12, 2019

Also: SimHash Document Encoder

Related Topics