I’ve spent a few days with the http://cortical.io white paper and implemented a very naïve retina supporting most of the API functionality that has been described. I’ve also added some simple operations of my own design. I’ve tested my retina with different types of texts and it looks promising.
I’ve ended up with two central questions: snippet distribution in the semantic space and the notion of “context” as described in the white paper.
When it comes to snippet distribution, I claim that it is more important to utilize as much of the space as possible to represent as much semantic information as possible rather than grouping semantically similar snippets close to each other. To this end, I’m using a hashing function to spread snippets in the space and make no attempts to inspect already added snippets.
As I can not compare my implementation with the reference implementation, since it’s proprietary as far as I know, I do not have any objective measure for how well my retina works. To help with this, I’m wondering if anyone here has a good suggestion for a well documented algorithm with known semantic properties for populating the semantic space that could be used for comparison.
(EDIT: Added clarification for my question.)
My question about context is much more basic: what is actually claimed to be performed through the context endpoint in the API? The documentation is very vague and the wording is weak when it comes to the claims being made.
The example being used is “apple” appearing in three different contexts: “software”, “fruit” and “desktop”. I can not find any description of how these contexts are deduced or how using one of these contexts to further specify what should be returned in future semantic searches affects the algorithm.
Am I just blind or is this also part of the proprietary technology? Has anyone any ideas as to how these contexts are deduced and how they are used in semantic searches?
As for my own implementation, the following is a description (for anyone curious) of what I’ve implemented and the texts I’ve used for testing.
The basic functionality is the semantic search as described in the white paper where a fingerprint is used to find the best overlaps with other fingerprints, removing the bits for the best match and repeat the search and culling until no bits are left. I’ve added a cut off where at least three bits must overlap to generate a match. This limit is arbitrary but it seems to remove coincidental snippet overlaps.
Every snippet contains at least two sentences and sometimes three.
Handling of stop words has been added to not have semantically interesting matches pushed down by words as “the” and “a”.
It is possible to list all snippets that are connected to a fingerprint and to generate a frequency list for all tokens in these snippets.
It is possible to perform a semantic search where two or more fingerprints are bitwise OR:ed or bitwise AND:ed.
I have tested my implementation with the Wikipedia page for Sweden, the book Laech by Plato, a public corpus with american English with some 500k words from different domains, a book about Roman archaeology in Britain, “On the origin of species” by Charles Darwin and the novella “Little Brother” by Cory Doctorow.
Part of a session with the Wikipedia page on Sweden looks something like this:
Reading snippets
436 snippets
436 mapped indices, 0.0266113 (~3% of the semantic space is used)
4235 tokens
Enter token:
denmark
denmark (17): (17 bits in the fingerprint)
swedish (10) (best overlap)
sweden (8)
norway (7)
finland (6)
scandinavian (4)
also (4)
danish (4)
countries (4)
south-west (3)
bridge (3)
swedish (10) => (7) (the 10 bits are removed from the original fingerprint)
sweden (4)
bridge (3)
countries (3)
sweden (4) => (3) (no more matches are generated)
(EDIT: Fixed formatting.)