How to encode categorical data using CategoryEncoder

I am trying to use CategoryEncoder for categorical data for prediction.

In my case, I predict value based on caller and value.
The csv header and sample data is as follows, where caller is categorical data, and specified as C in header:

func,caller,callee,timestamp,value
string,string,string,datetime,float
,C,,T,
ord_IInvQueryCSV_funcQ:127.0.1.1,funcK,com.gyl.scm.center.query.service.impl.InvQueryCSVImpl.funcQ:127.0.1.1,2018-01-01 03:26:13.960,48

Here is the corresponding encoding part for categorical data caller from model param in JSON format

            "caller": {
                "name": "caller",
                "fieldname": "caller",
                "w": 21,
                "categoryList": ["funcK","funcQ","funcM"],
                "type": "CategoryEncoder"
            }

It comes to an error while running program. However, the error disappear when the special flag C is not set in csv header. I am not sure whether the caller field is involved in prediction without C flag.

Could anyone tell me the right usage of categorical data?

Thanks

What is the error you mentioned?

    data_source = FileRecordStream(streamID=input_path)
  File "/usr/local/lib/python2.7/dist-packages/nupic/data/file_record_stream.py", line 249, in __init__
    FieldMetaType.integer)
AssertionError

That’s a strange error. Can you take a look at this example and see if it helps?

That should helps. It seems that the encoding subroutine w.r.t. categorical data is implemented in the user-defined snippet, rather than by nupic automatically, and C flag does not have to be set in the csv file as well.

Those file headers are used for swarming. Other than that, some example programs might read them, but most just ignore them. Remember you don’t need a CSV file to do this, you can just feed one row at a time from anywhere.

Great!

@rhyolight Thanks for your prompt reply!

I have a bunch of functions, and the task aims at predicting the elapsed time of these functions’ call. The corresponding data sample is as follows

function_name,timestamp,elapsed_time
funcA,2018-01-01 03:26:13.960,48
funcD,2018-01-01 04:23:16.187,51
funcB,2018-01-01 04:24:26.957,43
funcC,2018-01-01 04:25:17.428,27
funcA,2018-01-01 04:26:38.059,41
funcB,2018-01-01 04:26:19.097,31
funcC,2018-01-01 04:26:59.376,26

One can see that all the records w.r.t. different functions’ call are put together in one data file, I am asking whether it is possible to do the prediction for these functions in one model? or I have to create one model for each function to implement prediction respectively?

Thanks

You’ll likely have more success if you split it up into different models.

The point is there are hundreds or thousands of function probably. That’s a big issue, any advice? Thanks

BTW, I found a interesting phenomenon that the anomalyScore is big even though the difference between the actual value and predicted value is small sometimes. How to explain it?

See: Clarification on anomaly score and anomalylikelihood