I’m writing my own Python application, using the nupic library. Currently, my encoder generates input vectors for the HTM model whose length is about 170.000 bits.
I have no problems about creating the SpatialPooler instance with inputDimensions=(170828,) and columnDimensions=(10000,). It takes more than 2 minutes to finish, and consumes over 10GB of RAM, but so far so good.
However, when I try to compute an SDR with the 170.828 long input vector (and learning enabled), it crashes. The program simply finishes (without an error code) and I get the following exit code:
Process finished with exit code -1073741819 (0xC0000005)
I wonder whether this error comes from the fact of using a longer-than-allowed input vector, not having enough free memory at the moment of computation, or a more mysterious reason.
Any ideas? Thanks a lot in advance!
I just discovered, that the reason for the abrupt exit was that my inputArray to the compute method of the Spatial Pooler was a list, instead of a numpy.array.
Now it doesn’t exit, so my problem is solved. However, I still have the question about the upper limit of the input vector, as I will need to make it longer in the future (I’m just testing a part of it). Probably there is not such an upper limit (apart from the obvious RAM limit).
That is the largest input encoding I’ve ever heard of. I think you should brainstorm about ways to decrease it. Why is it so large?
Well, I’m trying to use the HTM model for Natural Language Processing (using the Spanish Wikipedia). The encoder tries to make a semantic representation of words, and that’s the reason why it is so large.
Cortical.IO creates semantic representations of words, and they create 4,000 bits per word / phrase / text block (typically). How many words are represented in the 170,000 bits? Please tell me more about the encoding so I can help.
I have also written my own implementation of semantic folding which creates word SDRs by crawling the English Wikipedia, so I can provide my own insights.
Cortical.IO’s implementation places similar contexts physically closer to each other on a 2D grid, so they should be able to do simple scaling math and round the coordinates to produce smaller SDRs that preserve the semantics.
In my implementation, I haven’t yet figured out an efficient method for getting similar contexts close to each other on a 2D grid, so my word SDRs do not have any topology (that isn’t something I have needed yet). I instead do a random sub-sampling of the large SDRs to create smaller working SDRs. Given the properties of SDRs, that is enough to also preserve semantics.
I have read about the Cortical.IO implementation, but I didn’t know how many bits per word they were using. Those 170.000 would be for a single word
In fact, technically I only need one forth of that number (42.000) but I was testing, using 4 times that number just in case I could improve the results obtained. I could also cut that number down, probably down to 10.000 or so. But at the time I’m just testing the results, not so worried about performance yet.
I don’t have topology either, as @Paul_Lamb writes in his message.
This is interesting, thanks for doing it. Please let us know what you find out with your testing. These types of tests are important to help define the system.
What do you think is missing without input topology? Do you even think it is necessary?