How can I encode data with large number of categories?

flash59 · December 30, 2019, 8:41am

Hi,
I know I can use CategoryEncoders, but ,I have a big dataset(network logs) to encode, I have to cluster them into some classes,(maybe 1000 classes or more) ,I will give each class a numberic id, so,I can encode each class into a SDR, the question is ,how can i decide the size of SDR(the amount of bytes) and the size of the buckets? thanks alot!

rhyolight · December 30, 2019, 4:02pm

Just to be sure, the input is temporal, not spatial, yes? You are dealing with a stream of data, not just a spatial classification task?

That is a lot of classes. Do you want them to have semantic similarity? Using the category encoder you won’t get any. You want encoder representations to have semantic meaning, but I would not encode one category into each bit of a 1000 bit space. Then there is only basically one bit difference between every category. I suggest you encode categories with a variable w value for the width of the encoding in bits. But this makes your encoding very large. For example if you gave each class 10 bits you have a 10,000 bit input space, which is pretty large.

flash59 · December 31, 2019, 2:21am

yes, the input is temporal,and ordered ,and is a sequence of data classes(clustered). I want to predict the next data class using HTM after learning the sequences. I know that HTM can learn and strengthen the relation between different data. I can feed HTM a data SDR at runtime, and HTM give me a prediction SDR, is this a good practice? thanks!

flash59 · December 31, 2019, 2:39am

No, there is no obvious semantic similarity between two data classes ,but the data point within the same class shows the high similarity.

yes, it is a problem. I can merge the classes into 300 classes, is that OK?

sheiser1 · December 31, 2019, 3:10am

If there’s no spatial overlap between any 2 inputs then the SP doesn’t really help right? It seems the category encoding vectors could go right into the TM. I think this setup allows you to replace the SDR Classifier process with a simple decoding, making the TM output simpler to interpret.

flash59 · December 31, 2019, 6:19am

category encoding vectors? Are there any implemention exsit in Java or any articles? Is it something like multi-encoder?
If I want to implement an encoder using Java, and replace the default encoders in the examples based on the HTM.java, is it very complex? I am familiar with Java. How can I test the new encoder?

rhyolight · December 31, 2019, 3:20pm

He is saying that you could take the encoding of your semantic data in the input space, skip the SP step entirely, and feed the encoding into the TM logic as if they were the active minicolumns. If you have no semantic overlap between the data you are entering in to the system, you generally don’t need to perform Spatial Pooling. The TM will figure out the sequence fine without the SP.

flash59 · January 4, 2020, 1:46am

Thanks alot !
I see now. I can config the system to skip SP, how can I do this? I think there is a tour for me somewhere , yes?

rhyolight · January 4, 2020, 3:20pm

You still need to visualize a cellular structure with minicolumns and cells in each, but instead of computing the SP to get a list of active minicolumns to send to the TM for computation, just use the encoder activation to get the active minicolumns. Ensure the encoder output size is the same number as the number of minicolumns the TM is configured with. Then you only have to compute the TM.

I know I have some code somewhere that uses Cortical IO SDRs and passes the directly into the TM… let me look… here is the video where Subutai passes SDRs from Cortical IO into the TM. You can find all the code here although I have not tired running this in years. Here is the part where a raw SDR is fed into the TM:

github.com

numenta/nupic.nlp-examples/blob/9feef7c06a1688c819229716fea6c6c6f977215d/nupic_nlp/nupic_words.py#L41-L48


def feed(self, sdr, learn=True):
  tm = self.tm
  narr = numpy.array(sdr, dtype="uint32")
  tm.compute(narr, learn=learn)
  # This returns the indices of the predictive minicolumns.
  predictiveCells = tm.getPredictiveCells()


  return numpy.unique(numpy.array(predictiveCells) / tm.getCellsPerColumn())

Topic		Replies	Views
Encoding the hierarchy data HTM.Java encoders	2	663	January 2, 2020
SDR questions for image encoding (newbie) Engineering encoders , question	5	2410	December 15, 2016
Decode SDR to Word HTM.Java	7	854	September 25, 2018
Encoder and Spatial Pooler Confusion Getting Started	17	897	April 5, 2019
Multiencoder and density of SDRs NuPIC	2	555	March 15, 2018

How can I encode data with large number of categories?

Related topics