I’m relatively new at using NuPIC and I’ve been trying to modify the code provided in the NuPIC examples for word prediction (https://github.com/numenta/nupic/tree/master/examples/prediction/category_prediction) to work with a larger data set (specifically, the penn tree bank data set). However, it only seems to run successfully when I limit the number of categories by limiting the data set size. When I try to run the model through the whole data set (~10000 unique words), it gives a MemoryError after 2 calls of Model.run().
The token files are generated successfully; the error only happens when I call Model.run(). I’ve already tried modifying the “maxCategoryCount” parameter to 10000, but the same error occurred. I’m not entirely sure what the problem is. The exact error is reproduced below:
This space might still be too large to represent. Think about it like this. If you have over 10K unique categories to encode, you need an input SDR that is at least 10K cells. Each cell would represent a unique value all by itself. I think you have too many unique values to represent.
We typically try to generalize a bit when creating encoders. Each category probably is not entirely different from others. Some may need to be encoded so there is semantic similarity between them. What do the categories represent? Can you list some typical values?