Text corpora used by cortical.io?

Does anyone happen to know what text corpora were used by cortical.io in their semantic folding process? I am writing a variation on the semantic folding process which generates contexts between inputs based on proximity of other inputs (versus using sentences as contexts). This would be useful for situations where you want to distill semantics an input stream which does not have clearly defined beginnings and endings like sentences do.

I would like to start by using it on words, to see if not relying on the concept of sentences will still be able to derive some of the well-known semantic relationships as cortical.io has (such as Jaguar - Porsche = Tiger, and Apple - Fruit = Computer). This obviously implies that the text sources will need a good number of sentences about large cats, luxury cars, fruit, and computers.

Otherwise, I’ll probably look at writing a crawler with a built in grammar checker to scrape good sentences from wikipedia.

I believe they used a subset of Wikipedia (approximately 200K pages) but the exact data set is not public.

1 Like

Perfect, that is what I was planning to do as well.