Also: SimHash Document Encoder

Hi all,

I took my new SimHash Scalar Encoder, and was able to easily make a Document Encoder version (link below).

Similar documents will receive similar sparse encodings, and un-similar documents will have more unrelated outputs.

Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You’ll want http://cortical.io for that.

How it works:

  • Take a document, split up the words, and hash each.
  • You can optionally add weights to the word hashes of your document (in order to make big words more important in the output than small words).
  • Combine the hashes into a sparse SimHash for the document
    (same method as SimHash Scalar encoder).

Notes:

  • This can be used as a category encoder if each document consists of a single word.
  • This shares a problem with the Coordinate Encoders: there is no Classifier yet (no predictions).

Source:

Next Steps:

  • Put Python version into Nupic.cpp, along with the SimHash Scalar encoder.
  • Rewrite those PY encoders to be C++ (also in Nupic.cpp)
  • Change each word from being a hash, to being a simhash created from hashes of each letter in the word. This way, near-spellings will be considered similar (“eat” vs. “eats”), which they currently are not.

thanks.

3 Likes

@brev thanks for documentation. I like to test c++ version. When do you plan to release c++ version?

1 Like

Hi @thanh-binh.to, I hope within the next few weeks.

1 Like

@brev thanks

1 Like