Natural Language Processing

Problem Description

From Wikipedia:

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding – that is, enabling computers to derive meaning from human or natural language input.

Design

NuPIC is not currently tuned for NLP, but should be capable of some basic NLP functions. If letters are used as categories, it should be able to recognize common word and sentence structures. However, without a hierarchy, it will not be able to formulate a deep understanding of input text, because it is limited to one small region of the brain within it’s model.

However, there could still be some interesting experiments performed even with this limitation. For example, words would be encoded into SDRs externally, through the cortical.io API, and feed directly into the CLA using a “pass-through” encoder.

Encoding

A preliminary “Pass-Thru” encoder has been merged into master in PR #257, which could be used to pass raw SDRs from cortical.io into the CLA.

NLP Tools

Other sources

The python Natural Language Tool Kit has a lot of NLP tools that could make word, sentence, phrase, and term extraction easier for corpus pre-processing.

Large pre-trained word vector sets from Google

https://code.google.com/p/word2vec/
(Ian Danforth on the ML)

NLP Videos

CEPT (now cortical.io) SDR Session with Francisco Webber (NuPIC 2013 Fall Hackathon

CEPT (now cortical.io) SDR Session with Francisco Webber (NuPIC 2013 Fall Hackathon

NuPIC For NLP

NuPIC For NLP
(starts at 21m 40s)

Visualizing Predicted Word SDRs (hackathon demo)

Visualizing Predicted Word SDRs

Visualizing Predicted Word SDRs (hackathon demo)

What Does the Fox Eat?

Famous (hackathon demo)

Francisco Webber, Soeren, and Erik from CEPT are trying to teach NuPIC to understand complex word associations with CEPT word SDRs.

Famous

Challenges!

Predict plural forms of nouns

Train the CLA on a bunch of nouns and their plural forms, both regular and irregular. Then give it a new singular form and the prediction for the plural form. Whether regular or irregular, the plural form should be correct.

  • cat :arrow_right: cats
  • status :arrow_right: statuses
  • dolly :arrow_right: dollies
  • cactus :arrow_right: cacti

Predict country capitals

Recently, Google blogged about their word2vec tool, and described how it could understand the association of countries and their capitals. Do the same thing!

Predict part-of-speech (POS, word type)

This could be useful to many applications, train the CLA to tell word-type. I’m thinking of two ways…
for the Spatial pooler, train on pairs :

  • cat -> noun
  • run -> verb
  • swiftly -> adjective

The other way would be harder to learn but should yeld better results, train SP+TP as a sequence for word-type pairs.

  • T1: Boy-noun
  • T2: ran-verb

Part of Speech Anomaly Detection

There is already some work done on predicting parts of speech based on NTLK POS input. Adding another experiment to this project that does anomaly detection instead of prediction could provide more value. Once the CLA has learned the typical patterns of POS within a corpus, it might identify grammatical errors with high anomaly scores.

Creating Word SDR Images from TP Output

In this experiment, NuPIC is producing SDR predictions based on sequences of word SDRs it has been fed. These predictions are given to the cortical.io API to find closest term for that SDR. It would be interesting to visualize these predicted SDRs in the same way cortical.io does through a pixel bitmap, like this:

CEPT SDR Image

These predicted bitmaps created by this word association experiment should have a sort of gaussian property to them.

Emergence of grammar (+ontology)

On a (simplified) text dataset train text generation (prediction of the next letter in a sequence) and observe if grammatic structures appear to be learned by the CLA.

Eg.:

  • He likeS
  • We like
  • ? Tom like(?s/’ ') …this Tom (=some unobserved term) example would require a use of ‘ontology’, providing an additional vector with info (ground truth he=singular, we=plural, Tom=singular), so training would look like:
  • T1: {‘H’, singular}, T2: {‘e’, singular}…

History

  • Aug 27, 2013: Community member Chetan Surpur created linguist project for predicting the rest of a user’s word, phrase, or sentence.
  • Aug 28, 2013: Community member James Tauber created nupic-texts project for pre-processing of corpus of childrens’ text.
  • Sept 12, 2013: Francisco Webber provided offline retina mapping of words within nupic-texts.
  • Sept 12, 2013: Created pycept project to use as python CEPT API client.
  • Sept 15, 2013: Had 2013 Hackathon NLP Planning Meeting to discuss the NLP focus of upcoming hackathon.
  • Sept 26, 2013: Matt published an initial project that uses the CEPT API to download a library of SDRs for nouns in singular and plural forms extracted from the corpus of text included within NLTK.
  • Oct 6, 2013: Matt updated his NLP Experiments project with an example of part of speech prediction using NLTK tagging.