I took my new SimHash Scalar Encoder, and was able to easily make a Document Encoder version (link below).
Similar documents will receive similar sparse encodings, and un-similar documents will have more unrelated outputs.
Similarity is defined as binary distance between strings, there is no kind of linguistic semantic understanding. You’ll want http://cortical.io for that.
How it works:
- Take a document, split up the words, and hash each.
- You can optionally add weights to the word hashes of your document (in order to make big words more important in the output than small words).
- Combine the hashes into a sparse SimHash for the document
(same method as SimHash Scalar encoder).
- This can be used as a category encoder if each document consists of a single word.
- This shares a problem with the Coordinate Encoders: there is no Classifier yet (no predictions).
- It’s now in the same Pull Request as my Scalar encoder on Old NuPIC.
- My personal research repo
- Article Permalink (this)
- Put Python version into Nupic.cpp, along with the SimHash Scalar encoder.
- Rewrite those PY encoders to be C++ (also in Nupic.cpp)
- Change each word from being a hash, to being a simhash created from hashes of each letter in the word. This way, near-spellings will be considered similar (“eat” vs. “eats”), which they currently are not.