How can I encode data with large number of categories?

I know I can use CategoryEncoders, but ,I have a big dataset(network logs) to encode, I have to cluster them into some classes,(maybe 1000 classes or more) ,I will give each class a numberic id, so,I can encode each class into a SDR, the question is ,how can i decide the size of SDR(the amount of bytes) and the size of the buckets? thanks alot!

Just to be sure, the input is temporal, not spatial, yes? You are dealing with a stream of data, not just a spatial classification task?

That is a lot of classes. Do you want them to have semantic similarity? Using the category encoder you won’t get any. You want encoder representations to have semantic meaning, but I would not encode one category into each bit of a 1000 bit space. Then there is only basically one bit difference between every category. I suggest you encode categories with a variable w value for the width of the encoding in bits. But this makes your encoding very large. For example if you gave each class 10 bits you have a 10,000 bit input space, which is pretty large.


yes, the input is temporal,and ordered ,and is a sequence of data classes(clustered). I want to predict the next data class using HTM after learning the sequences. I know that HTM can learn and strengthen the relation between different data. I can feed HTM a data SDR at runtime, and HTM give me a prediction SDR, is this a good practice? thanks!

No, there is no obvious semantic similarity between two data classes ,but the data point within the same class shows the high similarity.

yes, it is a problem. I can merge the classes into 300 classes, is that OK?

If there’s no spatial overlap between any 2 inputs then the SP doesn’t really help right? It seems the category encoding vectors could go right into the TM. I think this setup allows you to replace the SDR Classifier process with a simple decoding, making the TM output simpler to interpret.

1 Like

category encoding vectors? Are there any implemention exsit in Java or any articles? Is it something like multi-encoder?
If I want to implement an encoder using Java, and replace the default encoders in the examples based on the, is it very complex? I am familiar with Java. How can I test the new encoder?

He is saying that you could take the encoding of your semantic data in the input space, skip the SP step entirely, and feed the encoding into the TM logic as if they were the active minicolumns. If you have no semantic overlap between the data you are entering in to the system, you generally don’t need to perform Spatial Pooling. The TM will figure out the sequence fine without the SP.


Thanks alot !
I see now. I can config the system to skip SP, how can I do this? I think there is a tour for me somewhere , yes?

You still need to visualize a cellular structure with minicolumns and cells in each, but instead of computing the SP to get a list of active minicolumns to send to the TM for computation, just use the encoder activation to get the active minicolumns. Ensure the encoder output size is the same number as the number of minicolumns the TM is configured with. Then you only have to compute the TM.

I know I have some code somewhere that uses Cortical IO SDRs and passes the directly into the TM… let me look… here is the video where Subutai passes SDRs from Cortical IO into the TM. You can find all the code here although I have not tired running this in years. Here is the part where a raw SDR is fed into the TM:

1 Like