Cortical.io implementation, thoughts and questions

nlp
semantic-folding

#1

I’ve spent a few days with the http://cortical.io white paper and implemented a very naïve retina supporting most of the API functionality that has been described. I’ve also added some simple operations of my own design. I’ve tested my retina with different types of texts and it looks promising.

I’ve ended up with two central questions: snippet distribution in the semantic space and the notion of “context” as described in the white paper.

When it comes to snippet distribution, I claim that it is more important to utilize as much of the space as possible to represent as much semantic information as possible rather than grouping semantically similar snippets close to each other. To this end, I’m using a hashing function to spread snippets in the space and make no attempts to inspect already added snippets.

As I can not compare my implementation with the reference implementation, since it’s proprietary as far as I know, I do not have any objective measure for how well my retina works. To help with this, I’m wondering if anyone here has a good suggestion for a well documented algorithm with known semantic properties for populating the semantic space that could be used for comparison.

(EDIT: Added clarification for my question.)

My question about context is much more basic: what is actually claimed to be performed through the context endpoint in the API? The documentation is very vague and the wording is weak when it comes to the claims being made.

The example being used is “apple” appearing in three different contexts: “software”, “fruit” and “desktop”. I can not find any description of how these contexts are deduced or how using one of these contexts to further specify what should be returned in future semantic searches affects the algorithm.

Am I just blind or is this also part of the proprietary technology? Has anyone any ideas as to how these contexts are deduced and how they are used in semantic searches?


As for my own implementation, the following is a description (for anyone curious) of what I’ve implemented and the texts I’ve used for testing.

The basic functionality is the semantic search as described in the white paper where a fingerprint is used to find the best overlaps with other fingerprints, removing the bits for the best match and repeat the search and culling until no bits are left. I’ve added a cut off where at least three bits must overlap to generate a match. This limit is arbitrary but it seems to remove coincidental snippet overlaps.

Every snippet contains at least two sentences and sometimes three.

Handling of stop words has been added to not have semantically interesting matches pushed down by words as “the” and “a”.

It is possible to list all snippets that are connected to a fingerprint and to generate a frequency list for all tokens in these snippets.

It is possible to perform a semantic search where two or more fingerprints are bitwise OR:ed or bitwise AND:ed.

I have tested my implementation with the Wikipedia page for Sweden, the book Laech by Plato, a public corpus with american English with some 500k words from different domains, a book about Roman archaeology in Britain, “On the origin of species” by Charles Darwin and the novella “Little Brother” by Cory Doctorow.

Part of a session with the Wikipedia page on Sweden looks something like this:

Reading snippets
436 snippets
436 mapped indices, 0.0266113 (~3% of the semantic space is used)
4235 tokens
Enter token:
denmark
    denmark (17): (17 bits in the fingerprint)
        swedish (10) (best overlap)
        sweden (8)
        norway (7)
        finland (6)
        scandinavian (4)
        also (4)
        danish (4)
        countries (4)
        south-west (3)
        bridge (3)
      swedish (10) => (7) (the 10 bits are removed from the original fingerprint)
        sweden (4)
        bridge (3)
        countries (3)
      sweden (4) => (3) (no more matches are generated)

(EDIT: Fixed formatting.)


#2

Thanks for the post. I’m not sure if your questions should be directed at the HTM community (who no doubt are interested) or Cortical.IO the company (tagging @cogmission who still works for them I think!).

Feel free to edit your posts.


#3

I don’t have any insights into the official algorithm, but I have also developed my own implementation of semantic folding based on the videos that they posted. The way I have approached this task is in addition to generating the word SDRs, also keep track of their frequency. The frequency can be used to produce “weighted SDRs” (non-binary SDR, where the bits have a weight rather than just a zero or one value).

Two weighted SDRs can be compared to generate a weighted overlap score, which allows you to suggest other words which have a lot of overlap and are frequently used. The top results can be used as “contexts” when doing other SDR math.

One other important point is that the words most frequently used in a language tend to add the least amount of uniqueness to the context. This is a result of Zipfs law. This property is very useful to keep in mind when designing “word math” algorithms, since it means for many tasks you can throw out a large percentage of results and focus in on a smaller subset. I bring this up, because you will find that it is relevant to the “contexts” process I described above.

Wouldn’t the total number of contexts (i.e. bits in the SDR) be the same, or does your hashing function have some scaling property? (EDIT – NM, I get what you mean – you are essentially using random distribution to map points into a smaller space, and relying on property of SDRs that says a lot of random overlap is virtually impossible). Personally I have found that the main advantage to positioning semantically similar snippets close to each other is that you can perform a simple scaling algorithm on the massive original SDRs to produce much smaller working SDRs, so that points close to each other in the large SDR (which are semantically similar to each other) will map to the same point on the smaller SDR.


#4

I suspect they are building a SOM (Self Organizing Map) as the SDRs are added. If you have some ranking criteria then the grouping could be formed automatically.

I have wondered what the axis of the map are and how closeness is judged to build this self organizing map myself.


#5

One other point on topology. If you look at a typical word SDR where semantically similar snippets are close to each other, you see islands of clustered on bits. You can use various ML techniques to separate these islands and then do SDR overlap comparisons to suggest best matches for each of them. The results from this operation could also be used as “contexts”. I haven’t done this specifically myself, but it seems like another strategy worth exploring (and would be less computationally expensive than the one I described above)


#6

This is an interesting approach. It is not clear to me if you count how many times a token appears in the corpus or how many times a specific bit in the fingerprint is set so I’ll assume that you mean number of appearances in the corpus (as it makes the most sense to me). Please correct me if I’m wrong.

My initial response is that any frequency that can be observed must have a lower limit of the number of set bits in the fingerprint. As such, it looks like it doesn’t add much instead of just counting the number of set bits.

I have now done some quick tests and token frequency, by itself, changes the results a bit but tokens that are not stop words, but still used often, typically float to the top and are too wide in reach to add any value when it comes to context.

To handle this I changed the sorting to instead look at token frequency scaled by overlap. As in

overlap = (originalFingerprint & currentToken.fingerprint).count()
scaledFrequency = currentToken.frequency / overlap

This gives, as far as I can see, better results but popular tokens still affect the results a lot.

I have no math to back up the following, but, my intuition tells me that subsampling the original 16K bits should generate an equally useful result without the semantic topology on account of randomized proximal synapses and global inhibition.

I haven’t done any predictive tests with an HTM network yet but my plan is to test both with 16K mini-columns with no subsampling and something around 100 mini-columns with subsampling. Measuring the anomaly during these tests might give some answers.


#7

That looks like an interesting rabbit hole to go tumbling down. Do you have any special resource to recommend?


#8

I think where your approach may be limited (may not matter, depending on your use case though) is if you want to support online continuous learning, versus needing to compile all the text snippets up front. The hashing function (if I understand the approach correctly) would take a new novel snippet and give it a new random bit in the map. I imagine if you repeated this action continuously for every new text snippet, eventually over time the word SDRs would increase in density until they eventually became less useful.

That said, not having access to their official algorithm, I don’t know whether it would fare any better. My most recent strategy for combining topology and online learning is to assign each new word a random grid cell pattern and a weighted representation (an array of weights, not an SDR). As words stream in, they form an eligibility trace. The grid pattern for the current word is used to modify the weighted representations of the previous words using a logarithmic decay algorithm (such that closer words are impacted more strongly than distant ones in the eligibility trace). The weights become highest in the areas of greatest overlap, thus establishing the topology. A sparsification algorithm can then be used to form a normal SDR from the weighted representation.

Just to clarify one minor point, in the case of the topology approach, the number of set bits in the scaled map is much smaller than the frequency count. In any case, if you find that the things floating to the top are too generic, then you can adjust the percentage of results you are throwing out due to Zipfs law. It has been some time since I did these particular experiments, so I apologize for the lack of specifics (I’m sure they are doing something more efficient than this anyway – probably based in some way on the islands of activity in the representation).


#9

See if this sparks some crazy ideas!

http://www.ai.univ-paris8.fr/~jmelka/IJCCI_2017_20.pdf