On Friday, Sep 15, a planning meeting was held to help decide the direction for the Natural Language Processing (NLP) focus of the upcoming 2013 Fall Hackathon. Following is the agenda for the meeting, and meeting notes. Also in attendance during the meeting were Francisco Webber, Fergal Byrne, Jeff Hawkins, Subutai Ahmad, and Matt Taylor. A partial video of this meeting is available on YouTube.
Meeting Agenda
Primary Goals
- Provide necessary tools, documentation, and guidance in the area of NLP to provide a foundation for NLP work for hackathon participants.
- Identify higher-level building blocks for easier usage of NuPIC in general.
Current Tools
- retina text to sdr mapping file
- cept api for word-to-sdr mapping
- linguist project for feeding letters into nupic
- nupic-texts project for converting test into better formats
Docs to Create
- General introduction to NLP
- How to convert text into SDRs
- How to feed raw SDRs into NuPIC
- How to tweak SP / TP settings
- Links to relevant external resources
Open Questions
- Possible to link multiple CLAs into a hierarchy for NLP usage?
- Get one level working well first
Meeting Notes
- Provide a “back conversion” from the output of the CLA to translate an output SDR back into a word?
- cept has an api endpoint to do this, which returns the “closest” word to the input SDR? YES
- results could actually be outside of the input language corpus
- provide a baseline
- Random word SDR input vs word SDR comparison
- known sentences vs unknown sentences
- anomaly detection guidance could be a good idea, it should be easier to do this than prediction
- after training on sentences from a known corpus of words, feed in nonsense sentences and we should see high anomaly scores (re: syntactic anomalies)
- several sentences like: “Tom drove a X”, where X is a reasonable word, train on lots of values for X, then send unreasonable values to confirm high anomaly scores (re: semantic anomalies)
Tasks To Do:
- test results of random word SDRS vs real word SDRs and compare
- train on the known language set, then feed in grammatically incorrect input for anomaly detection
- Aside from the existing children text corpus, provide another more general simple corpus of common nouns, verbs, etc., for experimentation.
- Francisco will gather the most frequent words in language and create another mapping file
- Experiment: Train on the raw cept sdrs through the SP with the pass-through encoder, also train on normalized sdrs with fixed density, compare results