MemoryError when predicting with NuPIC

Hi everyone,

I’m relatively new at using NuPIC and I’ve been trying to modify the code provided in the NuPIC examples for word prediction (https://github.com/numenta/nupic/tree/master/examples/prediction/category_prediction) to work with a larger data set (specifically, the penn tree bank data set). However, it only seems to run successfully when I limit the number of categories by limiting the data set size. When I try to run the model through the whole data set (~10000 unique words), it gives a MemoryError after 2 calls of Model.run().

The token files are generated successfully; the error only happens when I call Model.run(). I’ve already tried modifying the “maxCategoryCount” parameter to 10000, but the same error occurred. I’m not entirely sure what the problem is. The exact error is reproduced below:

Any help would be greatly appreciated!

1 Like

Hi @wtz5pp and thanks for posting to the forums. Can you try using the SDRCategoryEncoder and see if that helps?

SDRCategoryEncoder has another parameter “n”; what would be a good value of n and w to use in this case?

This space might still be too large to represent. Think about it like this. If you have over 10K unique categories to encode, you need an input SDR that is at least 10K cells. Each cell would represent a unique value all by itself. I think you have too many unique values to represent.

We typically try to generalize a bit when creating encoders. Each category probably is not entirely different from others. Some may need to be encoded so there is semantic similarity between them. What do the categories represent? Can you list some typical values?

Each category is a different word (i.e. the, and, it, although, sheep…) --there are around 10,000 unique words in the data set.

Instead of representing words as categories, you should get semantic fingerprints from Cortical.IO. Here are some resources:

Thank you so much for the help!

The MemoryError doesn’t appear anymore after changing the encoder, so I think I’ll try the SDRCategoryEncoder first before looking into cortical.io.

Ok, but I’m not sure you will be very successful without encoding some semantic similarity between terms.