I’ve build multiple versions of HTM for iOS 14.5 with the end goal of improving Apple’s Natural Language framework’s accuracy in as many languages as possible.
The final class will be a baked-in version (unconnected synapses removed) of an HTM that can only compare SDRs (no learning). Specifically, it should be able to:
- Compare SDRs of text.
- Compare Unions of SDRs of text.
For example, if you compare the definitions of “miso soup” and “chicken soup” many bits should be overlapping. The union of definitions for all soups and the definition of a random soup should also have a significant number of bits overlapping.
Apple’s NL framework allows for this as well as other functionality via word and sentence embedding but the accuracy is very poor and many words are missing.
Surprisingly, it does an amazing job at tokenization and lemmatization of text (even for words that are missing somehow) which I will use to process the text before I feed it to the HTM.
Currently, I have the HTM version which I will need (spatial pooler only) and the training data (100,000 Wikipedia articles). I’m looking for ways to make this work. For the final version I will use as many Wikipedia articles as I can find (in the order of millions) for all supported languages.
Using Francisco Webber’s method explained here. To be honest that’s way over my current programming abilities mainly because of not having any prior experience with machine learning algorithms and networks in general. I just can’t wrap my head around exactly how he does it (vectors of documents and kohonen maps) even though on the surface it looks and sounds very simple.
Alternatively, I can use Apple’s NL framework to extract spatial aspects of text (nouns and adjectives) and use the spatial pooler only on those. It makes sense since TM, grid cells and displacement cells aren’t implemented.
Using Apple’s NLP:
Tokenize articles to sentences, and sentences to words. Then add the title of the article to each sentence if missing. For example: “title: Socrates”, “extract: Socrates was a Greek philosopher. He was from Athens.”
Sentence1 = [“socrates”, “was”, “a”, “greek”, “philosopher”]
Sentence2 = [“socrates”, “he”, “was”, “from”, “athens”]
This connects sub-objects like philosopher and athens to the main object socrates.
Using Apple’s NLP:
For each word use its stem form (“apples” to “apple”, “was” to “be”, etc.).
Sentence1 = [“socrates”, “be”, “a”, “greek”, “philosopher”]
Sentence2 = [“socrates”, “I”, “be”, “from”, “athens”]
This solves the problem of having plural forms of nouns like philosophers.
Using Apple’s NLP:
Use only nouns and adjectives.
Sentence1 = [“socrates”,“greek”, “philosopher”]
Sentence2 = [“socrates”, “athens”]
This removes inconsistencies from having to capture temporal meaning with neural structures that can’t encode temporal meaning.
Create a sensor for all the nouns and adjectives found in those articles. Probably, over 100,000 words.
For a single word as input the output is an extremely sparse representation of 100,000 bits with only 1 active bit.
Create an HTM with 65,536 (128x128) columns and potentialPercent=0.1 (receptive field of 10,000 bits for each column).
Create an SDR for each sentence and learn it.
Essentially, an HTM is a compression algorithm.
A single word may invoke a few hundred columns.
By going backwards I can get all the sensory bits (words) that those columns connect to.
Then by feeding all those words to the HTM again I should be able to get a much broader representation of related terms?
When comparing two pieces of text I will first pass them through Apple’s NLP to filter out any unwanted text and keep only the stem form of nouns and adjectives.
Then, I will create one SDR for each piece of text which will be the union of SDRs of all words inside that piece of text.
Finally, I will feed both to the HTM and compare active bits or perform a double passthrough as explained in step 7 and then compare bits.
What do you think?
EDIT: I’ll be using only nouns without the stem form of words.