Text corpora used by cortical.io?

Paul_Lamb · April 14, 2017, 12:34am

Does anyone happen to know what text corpora were used by cortical.io in their semantic folding process? I am writing a variation on the semantic folding process which generates contexts between inputs based on proximity of other inputs (versus using sentences as contexts). This would be useful for situations where you want to distill semantics an input stream which does not have clearly defined beginnings and endings like sentences do.

I would like to start by using it on words, to see if not relying on the concept of sentences will still be able to derive some of the well-known semantic relationships as cortical.io has (such as Jaguar - Porsche = Tiger, and Apple - Fruit = Computer). This obviously implies that the text sources will need a good number of sentences about large cats, luxury cars, fruit, and computers.

Otherwise, I’ll probably look at writing a crawler with a built in grammar checker to scrape good sentences from wikipedia.

subutai · April 14, 2017, 7:53pm

I believe they used a subset of Wikipedia (approximately 200K pages) but the exact data set is not public.

Paul_Lamb · April 14, 2017, 7:54pm

Perfect, that is what I was planning to do as well.

Topic		Replies	Views
Cortical.io implementation, thoughts and questions Engineering nlp , semantic-folding	8	1468	October 17, 2018
Semantic Folding: See Cortical.io's Cool New Video! Lounge	2	782	August 4, 2017
Are HTMs used on the forum? Lounge	5	658	November 27, 2018
Cortical.io News: Similarity Explorer Goes Multilingual Lounge	3	790	December 1, 2016
Cortical.io word fingerprints have topology Numenta Theory encoders , topology	6	1368	December 23, 2016

Text corpora used by cortical.io?

Related topics