How to encode categorical data using CategoryEncoder

white · March 20, 2018, 9:46am

I am trying to use CategoryEncoder for categorical data for prediction.

In my case, I predict value based on caller and value.
The csv header and sample data is as follows, where caller is categorical data, and specified as C in header:

func,caller,callee,timestamp,value
string,string,string,datetime,float
,C,,T,
ord_IInvQueryCSV_funcQ:127.0.1.1,funcK,com.gyl.scm.center.query.service.impl.InvQueryCSVImpl.funcQ:127.0.1.1,2018-01-01 03:26:13.960,48

Here is the corresponding encoding part for categorical data caller from model param in JSON format

            "caller": {
                "name": "caller",
                "fieldname": "caller",
                "w": 21,
                "categoryList": ["funcK","funcQ","funcM"],
                "type": "CategoryEncoder"
            }

It comes to an error while running program. However, the error disappear when the special flag C is not set in csv header. I am not sure whether the caller field is involved in prediction without C flag.

Could anyone tell me the right usage of categorical data?

Thanks

rhyolight · March 20, 2018, 1:36pm

What is the error you mentioned?

white · March 20, 2018, 2:35pm

    data_source = FileRecordStream(streamID=input_path)
  File "/usr/local/lib/python2.7/dist-packages/nupic/data/file_record_stream.py", line 249, in __init__
    FieldMetaType.integer)
AssertionError

rhyolight · March 20, 2018, 2:38pm

That’s a strange error. Can you take a look at this example and see if it helps?

white · March 20, 2018, 3:22pm

That should helps. It seems that the encoding subroutine w.r.t. categorical data is implemented in the user-defined snippet, rather than by nupic automatically, and C flag does not have to be set in the csv file as well.

rhyolight · March 20, 2018, 4:00pm

Those file headers are used for swarming. Other than that, some example programs might read them, but most just ignore them. Remember you don’t need a CSV file to do this, you can just feed one row at a time from anywhere.

white · March 21, 2018, 1:04am

Great!

white · March 21, 2018, 6:22am

@rhyolight Thanks for your prompt reply!

I have a bunch of functions, and the task aims at predicting the elapsed time of these functions’ call. The corresponding data sample is as follows

function_name,timestamp,elapsed_time
funcA,2018-01-01 03:26:13.960,48
funcD,2018-01-01 04:23:16.187,51
funcB,2018-01-01 04:24:26.957,43
funcC,2018-01-01 04:25:17.428,27
funcA,2018-01-01 04:26:38.059,41
funcB,2018-01-01 04:26:19.097,31
funcC,2018-01-01 04:26:59.376,26

One can see that all the records w.r.t. different functions’ call are put together in one data file, I am asking whether it is possible to do the prediction for these functions in one model? or I have to create one model for each function to implement prediction respectively?

Thanks

rhyolight · March 21, 2018, 3:40pm

You’ll likely have more success if you split it up into different models.

white · March 21, 2018, 11:54pm

The point is there are hundreds or thousands of function probably. That’s a big issue, any advice? Thanks

BTW, I found a interesting phenomenon that the anomalyScore is big even though the difference between the actual value and predicted value is small sometimes. How to explain it?

rhyolight · March 22, 2018, 4:36pm

See: Clarification on anomaly score and anomalylikelihood

Topic		Replies	Views
Trouble using Category encoder C as special flag and with string datatype NuPIC	8	829	June 15, 2016
Defining Category encoders for Network configuration NuPIC question	4	436	October 2, 2019
Community CategoryEncoder in python exist? NuPIC	8	858	April 12, 2019
Swarm with category encoder NuPIC	1	489	June 13, 2016
Pass through encoder example? NuPIC encoders , pass-through-encoder	10	862	June 5, 2018

How to encode categorical data using CategoryEncoder

Related topics