Problem Description
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding – that is, enabling computers to derive meaning from human or natural language input.
Design
NuPIC is not currently tuned for NLP, but should be capable of some basic NLP functions. If letters are used as categories, it should be able to recognize common word and sentence structures. However, without a hierarchy, it will not be able to formulate a deep understanding of input text, because it is limited to one small region of the brain within it’s model.
However, there could still be some interesting experiments performed even with this limitation. For example, words would be encoded into SDRs externally, through the cortical.io API, and feed directly into the CLA using a “pass-through” encoder.
Encoding
A preliminary “Pass-Thru” encoder has been merged into master in PR #257, which could be used to pass raw SDRs from cortical.io into the CLA.
NLP Tools
Other sources
The python Natural Language Tool Kit has a lot of NLP tools that could make word, sentence, phrase, and term extraction easier for corpus pre-processing.
Large pre-trained word vector sets from Google
https://code.google.com/p/word2vec/
(Ian Danforth on the ML)
NLP Videos
CEPT (now cortical.io) SDR Session with Francisco Webber (NuPIC 2013 Fall Hackathon
NuPIC For NLP
Visualizing Predicted Word SDRs (hackathon demo)
Visualizing Predicted Word SDRs (hackathon demo)
Famous (hackathon demo)
Francisco Webber, Soeren, and Erik from CEPT are trying to teach NuPIC to understand complex word associations with CEPT word SDRs.
Challenges!
Predict plural forms of nouns
Train the CLA on a bunch of nouns and their plural forms, both regular and irregular. Then give it a new singular form and the prediction for the plural form. Whether regular or irregular, the plural form should be correct.
- cat cats
- status statuses
- dolly dollies
- cactus cacti
Predict country capitals
Recently, Google blogged about their word2vec tool, and described how it could understand the association of countries and their capitals. Do the same thing!
Predict part-of-speech (POS, word type)
This could be useful to many applications, train the CLA to tell word-type. I’m thinking of two ways…
for the Spatial pooler, train on pairs :
- cat → noun
- run → verb
- swiftly → adjective
The other way would be harder to learn but should yeld better results, train SP+TP as a sequence for word-type pairs.
- T1: Boy-noun
- T2: ran-verb
…
Part of Speech Anomaly Detection
There is already some work done on predicting parts of speech based on NTLK POS input. Adding another experiment to this project that does anomaly detection instead of prediction could provide more value. Once the CLA has learned the typical patterns of POS within a corpus, it might identify grammatical errors with high anomaly scores.
Creating Word SDR Images from TP Output
In this experiment, NuPIC is producing SDR predictions based on sequences of word SDRs it has been fed. These predictions are given to the cortical.io API to find closest term for that SDR. It would be interesting to visualize these predicted SDRs in the same way cortical.io does through a pixel bitmap, like this:
These predicted bitmaps created by this word association experiment should have a sort of gaussian property to them.
Emergence of grammar (+ontology)
On a (simplified) text dataset train text generation (prediction of the next letter in a sequence) and observe if grammatic structures appear to be learned by the CLA.
Eg.:
- He likeS
- We like
- ? Tom like(?s/’ ') …this Tom (=some unobserved term) example would require a use of ‘ontology’, providing an additional vector with info (ground truth he=singular, we=plural, Tom=singular), so training would look like:
- T1: {‘H’, singular}, T2: {‘e’, singular}…
History
- Aug 27, 2013: Community member Chetan Surpur created linguist project for predicting the rest of a user’s word, phrase, or sentence.
- Aug 28, 2013: Community member James Tauber created nupic-texts project for pre-processing of corpus of childrens’ text.
- Sept 12, 2013: Francisco Webber provided offline retina mapping of words within nupic-texts.
- Sept 12, 2013: Created pycept project to use as python CEPT API client.
- Sept 15, 2013: Had 2013 Hackathon NLP Planning Meeting to discuss the NLP focus of upcoming hackathon.
- Sept 26, 2013: Matt published an initial project that uses the CEPT API to download a library of SDRs for nouns in singular and plural forms extracted from the corpus of text included within NLTK.
- Oct 6, 2013: Matt updated his NLP Experiments project with an example of part of speech prediction using NLTK tagging.