Natural Language Processing

rhyolight · April 5, 2017, 9:23pm

Problem Description

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding – that is, enabling computers to derive meaning from human or natural language input.

Design

NuPIC is not currently tuned for NLP, but should be capable of some basic NLP functions. If letters are used as categories, it should be able to recognize common word and sentence structures. However, without a hierarchy, it will not be able to formulate a deep understanding of input text, because it is limited to one small region of the brain within it’s model.

However, there could still be some interesting experiments performed even with this limitation. For example, words would be encoded into SDRs externally, through the cortical.io API, and feed directly into the CLA using a “pass-through” encoder.

Encoding

A preliminary “Pass-Thru” encoder has been merged into master in PR #257, which could be used to pass raw SDRs from cortical.io into the CLA.

NLP Tools

Other sources

The python Natural Language Tool Kit has a lot of NLP tools that could make word, sentence, phrase, and term extraction easier for corpus pre-processing.

Large pre-trained word vector sets from Google

https://code.google.com/p/word2vec/
(Ian Danforth on the ML)

NLP Videos

CEPT (now cortical.io) SDR Session with Francisco Webber (NuPIC 2013 Fall Hackathon

NuPIC For NLP

(starts at 21m 40s)

Visualizing Predicted Word SDRs (hackathon demo)

Famous (hackathon demo)

Francisco Webber, Soeren, and Erik from CEPT are trying to teach NuPIC to understand complex word associations with CEPT word SDRs.

Challenges!

Predict plural forms of nouns

Train the CLA on a bunch of nouns and their plural forms, both regular and irregular. Then give it a new singular form and the prediction for the plural form. Whether regular or irregular, the plural form should be correct.

cat cats
status statuses
dolly dollies
cactus cacti

Predict country capitals

Recently, Google blogged about their word2vec tool, and described how it could understand the association of countries and their capitals. Do the same thing!

Predict part-of-speech (POS, word type)

This could be useful to many applications, train the CLA to tell word-type. I’m thinking of two ways…
for the Spatial pooler, train on pairs :

cat → noun
run → verb
swiftly → adjective

The other way would be harder to learn but should yeld better results, train SP+TP as a sequence for word-type pairs.

T1: Boy-noun
T2: ran-verb
…

Part of Speech Anomaly Detection

There is already some work done on predicting parts of speech based on NTLK POS input. Adding another experiment to this project that does anomaly detection instead of prediction could provide more value. Once the CLA has learned the typical patterns of POS within a corpus, it might identify grammatical errors with high anomaly scores.

Creating Word SDR Images from TP Output

In this experiment, NuPIC is producing SDR predictions based on sequences of word SDRs it has been fed. These predictions are given to the cortical.io API to find closest term for that SDR. It would be interesting to visualize these predicted SDRs in the same way cortical.io does through a pixel bitmap, like this:

CEPT SDR Image

These predicted bitmaps created by this word association experiment should have a sort of gaussian property to them.

Emergence of grammar (+ontology)

On a (simplified) text dataset train text generation (prediction of the next letter in a sequence) and observe if grammatic structures appear to be learned by the CLA.

Eg.:

He likeS
We like
? Tom like(?s/’ ') …this Tom (=some unobserved term) example would require a use of ‘ontology’, providing an additional vector with info (ground truth he=singular, we=plural, Tom=singular), so training would look like:
T1: {‘H’, singular}, T2: {‘e’, singular}…

History

Aug 27, 2013: Community member Chetan Surpur created linguist project for predicting the rest of a user’s word, phrase, or sentence.
Aug 28, 2013: Community member James Tauber created nupic-texts project for pre-processing of corpus of childrens’ text.
Sept 12, 2013: Francisco Webber provided offline retina mapping of words within nupic-texts.
Sept 12, 2013: Created pycept project to use as python CEPT API client.
Sept 15, 2013: Had 2013 Hackathon NLP Planning Meeting to discuss the NLP focus of upcoming hackathon.
Sept 26, 2013: Matt published an initial project that uses the CEPT API to download a library of SDRs for nouns in singular and plural forms extracted from the corpus of text included within NLTK.
Oct 6, 2013: Matt updated his NLP Experiments project with an example of part of speech prediction using NLTK tagging.

Topic		Replies	Views
Tools for NLP Engineering nlp , tools , nupic-wiki	3	1931	May 24, 2017
Seq2seq example? NuPIC	3	933	March 4, 2020
2013 Hackathon NLP Planning Meeting Notes NuPIC planning , hackathon	0	724	April 5, 2017
NLP Projects NuPIC nlp , projects , nupic-wiki	0	1278	April 6, 2017
Natural Language Processing Lounge	3	1158	December 18, 2018

Natural Language Processing

Problem Description

Design

Encoding

NLP Tools

Other sources

Large pre-trained word vector sets from Google

NLP Videos

CEPT (now cortical.io) SDR Session with Francisco Webber (NuPIC 2013 Fall Hackathon

NuPIC For NLP

Visualizing Predicted Word SDRs (hackathon demo)

Visualizing Predicted Word SDRs (hackathon demo)

Famous (hackathon demo)

Challenges!

Predict plural forms of nouns

Predict country capitals

Predict part-of-speech (POS, word type)

Part of Speech Anomaly Detection

Creating Word SDR Images from TP Output

Emergence of grammar (+ontology)

History

Related topics