Performance optimization of

We were estimating the ability of to perform well in the condition when the data stream is very fast, but found out that in our case it was capable of processing just 100 records per second. For our real business cases it is definitely NOT enough, and if you can help me to find out what I am doing wrong - I will be very grateful.

We’ve made a POC application in Spark which read the data from Kafka and processed with mapWithState in order to enrich the obtained records with the anomaly scores. The dataset was quite noisy - the data was taken from - it was hourly ozone data for 2014-2016 years.
I’ve selected only those devices who has almost no missing values during the year (measurements are taken each hour). The file is located here:

Together with this I catched the flame graphs from the application using my method.

The SVG files of the flamegraphs can be found here:


I suggest to download them and open in Chrome - they are interactive.

First, we observed that fast serialization library works much worse than Kryo. But that was not a big deal for us making our beloved serialization (Kryo I mean).

The more important problem is this:

Half of computation time we spend in ArrayUtils. I investigated this code a little bit and I see there some work with java.lang.reflect.Array which is slow.

Obvious question to authors: can’t it really be fastened somehow?

1 Like

How large is your cellular network? How many columns are in the spatial pooler, and how many cells per column?

I’m sure it can be. @mrcslws spent considerable time speeding up the C++ and Python parts of NuPIC. I don’t think @cogmission was paying a lot of attention to speed when he was porting the code (correct me if I am wrong).

I’m sure it can be sped up too. I was paying attention at least to the point where I observed obvious rules of thumb for Java programming. There are some places where java.lang.reflect.Array is used due to having to “discover” array dimensions and construct arrays regardless of dimension - and there is a recursive function to do that on behalf of the SpatialPooling algorithm - I forget which method.

I am very surprised to hear that you have speed discrepancies with FST serialization - I chose FST because it WAS faster than Kryo for the most part, but also because there wasn’t any non-standard setup necessary (it is a drop-in replacement for standard Java serialization).

Anyway, the answer to your question is yes, I’m sure there is a good amount that can be done to increase its speed - one thing right off of the top is to use the UniversalRandom RNG instead of the others. However, as far as I know there are limitations to the algorithms for fast streaming to begin with?? (correct me if I’m wrong?)

Matt, David,

thank you very much for your answers and sorry for my long silence.

I’ve put my code at github - it will allow you to see what is going on. This is not the whole code which we have here (I mean not the whole Spark solution). I just left the code which is related to HTM,java performance and put it to a stand-alone console app.

Our MAIN concern when selecting this technology is the rate of 200 records/second which is NOT enough for us: our solution architect requested 10000 records per second.

So the question is: am I doing something wrong in my code? Or this is the best achievable result?

As to serialization: Kryo is really faster, believe me. We did a benchmark and measured two kinds of serialization. But this is a different topic, I wouldn’t like to mix them here (sorry for doing it at the beginning). Actually I did not understand what you meant by " However, as far as I know there are limitations to the algorithms for fast streaming to begin with?? (correct me if I’m wrong?)". Feel free to write me directly to - I will answer all the questions.

Thanks in advance for your attention.



Not that HTM.Java has exactly the same performance, but do you know the current throughput of nupic.core in records/second?

Of course it depends on the computer, the data, and the model configuration, but the number we usually try to use as a baseline for developer laptops os 20ms per cycle, or 50 records per second.

But this is generally with some datetime encoding, which increases the time required. With sub-second records, you should not use datetime encoding as discussed in other parts of the forum. However I really have no idea how fast a one-field scalar anomaly detection model would be if we tried to run it bare-bones.



I’ll take a look at your code when I can and see if anything pops out at me. I prefer to keep the conversation on the forums so others can benefit from it and solutions can contribute to the overall evolution of all of HTM Theoretical pursuits. I would also engage with questions in the NuPIC forum for general algorithmic performance hints? HTM is young (as compared to regular ML), in terms of performance optimization - as there is proportionally more of a frontier in its overall theoretical development, leaving no time to prematurely optimize, at this point… So I wouldn’t get my hopes up on 10^3 level performance at this point. The use cases mostly point to inference quality over quantity at this time…


I did all my best to make a nupic equivalent:

but unfortunately I am getting a strange error when running this:

ERR:  Unable to create region tm of type TMRegion [C:\projects\nupic-core\src\nupic\engine\RegionImplFactory.cpp line 272]
Traceback (most recent call last):
  File "D:\DevTools\JetBrains\PyCharm Community Edition 2017.1.3\helpers\pydev\", line 1585, in <module>
globals =['file'], None, None, is_module)
  File "D:\DevTools\JetBrains\PyCharm Community Edition 2017.1.3\helpers\pydev\", line 1015, in run
pydev_imports.execfile(file, globals, locals)  # execute the script
  File "D:/Projects_HTM/AnomalyDetection/htm-benchmark/htm-python/", line 125, in <module>
net = create_network()
  File "D:/Projects_HTM/AnomalyDetection/htm-benchmark/htm-python/", line 87, in create_network
network.addRegion("tm", "py.TMRegion", json.dumps(_TM_PARAMS))
  File "C:\Anaconda2\lib\site-packages\nupic\engine\", line 639, in addRegion
engine_internal.Network.addRegion(self, name, nodeType, nodeParams)
  File "C:\Anaconda2\lib\site-packages\nupic\bindings\", line 1454, in addRegion
return _engine_internal.Network_addRegion(self, name, nodeType, nodeParams)
TypeError: __init__() takes at least 4 arguments (6 given)

Matt, could you please help me with this?

OS: Windows 10 x64
Python: Anaconda 4.4.0, fresh installation (just did it now), python 2.7.13
then launched pip install nupic
and got success:

Successfully installed DBUtils-1.1 PyMySQL-0.6.2 PyYAML-3.10 apipkg-1.4 asteval-0.9.1 coverage-3.7.1 execnet-1.4.1 mock-
1.0.1 nupic-1.0.2 nupic.bindings-1.0.0 ordereddict-1.1 prettytable-0.7.2 psutil-1.0.1 pyproj-1.9.3 pytest-cov-2.5.0 pyte
st-xdist-1.16.0 python-dateutil-2.1 unittest2-0.5.1 validictory-0.9.1

You are missing inputWidth in the TM params:

You calculated it in the SP Params, but you also need it here. I think you want to match this signature:

I commited the corrected python code.

Obtained results:

Python: 840 records/second
Java: 198 records/second


  1. I noticed that Python version doesn’t find anomalies with score > 0.9 (contraty to Java version) except from those which are found in the very beginning. Did I do something wrong in my python code?

  2. Python version is just 4 times faster than java. As I see by the nupic code, c++ implementation is used inside the python classes. Besides, in _SP_PARAMS and _TM_PARAMS I explicitly set “cpp”. Should it be just 4 times faster? Is really C++ implementation used in my case? Actually I expected much MORE robustness…

  3. Aren’t my network parameters “too huge” for the task of prediction of a measurement of just one variable?

Thanks in advance.



as to the

you have several places which use java.lang.reflect.Array.get().

Inspired by this

I made some dummy changes which gave me 10% performance boost. That were QUICK changes, so maybe if you could think a little bit more in this (or other) directions, you would be able to do even better.

public static void setValue(Object array, int value, int... indexes) {
    int len = indexes.length;
    if (len == 1) {
        ((int[])array)[indexes[0]] = value;
    else if (len == 2){
        ((int[][])array)[indexes[0]][indexes[1]] = value;
    else if (len == 3){
        ((int[][][])array)[indexes[0]][indexes[1]][indexes[2]] = value;
    else {
        int i = 0;
        Object arr = array;
        while (i < len - 1){
            arr = Array.get(arr, indexes[i]);
        ((int[])arr)[indexes[i]] = value;

public static Object fastGet(Object array, int index){
    Class<?> c = array.getClass();
    if (int[].class == c) {
        return ((int[])array)[index];
    } else if (float[].class == c) {
        return ((float[])array)[index];
    } else if (boolean[].class == c) {
        return ((boolean[])array)[index];
    } else if (char[].class == c) {
        return ((char[])array)[index];
    } else if (double[].class == c) {
        return ((double[])array)[index];
    } else if (long[].class == c) {
        return ((long[])array)[index];
    } else if (short[].class == c) {
        return ((short[])array)[index];
    } else if (byte[].class == c) {
        return ((byte[])array)[index];
    return ((Object[])array)[index];

public static Object fastGet(Object array, int index1, int index2){
    Class<?> c = array.getClass();
    if (int[][].class == c) {
        return ((int[][])array)[index1][index2];
    } else if (float[][].class == c) {
        return ((float[][])array)[index1][index2];
    } else if (boolean[][].class == c) {
        return ((boolean[][])array)[index1][index2];
    } else if (char[][].class == c) {
        return ((char[][])array)[index1][index2];
    } else if (double[][].class == c) {
        return ((double[][])array)[index1][index2];
    } else if (long[][].class == c) {
        return ((long[][])array)[index1][index2];
    } else if (short[][].class == c) {
        return ((short[][])array)[index1][index2];
    } else if (byte[][].class == c) {
        return ((byte[][])array)[index1][index2];
    return ((Object[][])array)[index1][index2];

public static Object fastGet(Object array, int index1, int index2, int index3){
    Class<?> c = array.getClass();
    if (int[][][].class == c) {
        return ((int[][][])array)[index1][index2][index3];
    } else if (float[][][].class == c) {
        return ((float[][][])array)[index1][index2][index3];
    } else if (boolean[][][].class == c) {
        return ((boolean[][][])array)[index1][index2][index3];
    } else if (char[][][].class == c) {
        return ((char[][][])array)[index1][index2][index3];
    } else if (double[][][].class == c) {
        return ((double[][][])array)[index1][index2][index3];
    } else if (long[][][].class == c) {
        return ((long[][][])array)[index1][index2][index3];
    } else if (short[][][].class == c) {
        return ((short[][][])array)[index1][index2][index3];
    } else if (byte[][][].class == c) {
        return ((byte[][][])array)[index1][index2][index3];
    return ((Object[][][])array)[index1][index2][index3];

 * Get <tt>value</tt> for <tt>array</tt> at specified position <tt>indexes</tt>
 * @param array
 * @param indexes
public static Object getValue(Object array, int... indexes) {
    int len = indexes.length;
    if (len == 1)
        return fastGet(array, indexes[0]);
    else if (len == 2){
        return fastGet(array, indexes[0], indexes[1]);
    else if (len == 3){
        return fastGet(array, indexes[0], indexes[1], indexes[2]);
        Object slice = array;
        for(int i = 0; i < len; i++) {
            slice = Array.get(slice, indexes[i]);
        return slice;


There are probably a few places where low hanging fruit is available, but my main goal has been to arrive at the most complete feature set possible. Last year I took 3 1/2 months off of my day job to find a nasty bug that turned out to be a functional item that didn’t even exist in HTM.Java’s CodeBase, after forcing me to rewrite the TemporalMemory three times and the SpatialPooler twice (and their corresponding tests) in an effort to trace floating point variations bit by bit. It was a grueling experience that left me in a moral debt to my day job. Fortunately, I am at the tail end of completing the first version of IRIS for

I’ve never done a formal performance pass over the code, mostly because NuPIC isn’t anywhere near complete and I didn’t want to do premature optimization. Right now the code’s main purpose is as a learning tool.

I have a number of things going on in my personal life (finally settling my parent’s estate next week), however I could look into it in a week’s time and perhaps provide an interim boost to the performance since you seem to have more of an immediate need? But I can’t do anything before that.

I hope that works for you?

David, thank you very much for your answer.

Of course this works.

Just for your information, I have two plans:

  1. short term plan: to make a demo of HTM on Tuesday, 29th of August - I am going to make a speech about HTM at the Big Data User Group Meetup. No need to rush with optimization till this event.

  2. long lasting strategic plan: to present to our architects and managers that HTM is a suitable technology for anomaly detection tasks. I am working as a Big Data software engineer, and I often do different POCs. During my last POC I did investigation of HTM and liked it very much, and I’ve already made one internal presentation for my colleagues. However, some of our architects told that “100 records per second is not enough - we need 10000 per second”. So, I wonder if we can achieve 100x performance boost.

1 Like

We will see, but my guess is that that goal is considerably outside of the performance band of the HTM Algorithm (as a non-hardware solution). No promises but I’ll do the best I can?

(Remember, HTM does a lot more each cycle then a typical machine language application, remember it actually changes dimension by adding and subtracting dendrites and synapses.)

@ibobak I have been following your Spark POC with interest - unlike HTM, this is an area I have a decent understanding of. :grinning:

First of all, I’m curious about the need for 10,000 records per second in an anomaly detection scenario, perhaps you could describe the scenario a bit more? If it’s a high frequency measurement from a single source, does the input change enough over time to warrant that frequency, or can it be pre-aggregated/filtered? Similarly, if it’s a convergence of measurements from a large number of sources, is processing all the records necessary?

Aside from your current needs and the optimisations you’ve identified already, it seems to me that HTM implementations will almost certainly need to scale beyond an SMP architecture at some point. I understand this is not a current engineering goal of Numenta, but if nobody else in the community has investigated this yet then I’d certainly be keen to start looking into it.

I’ve taken a look at this today, and I can see there’s been an Apache Flink implementation already:

You should be able to use the above as-is on Amazon EMR, to test out performance. If the cluster gets expensive you could look at the Flink Python API and use NuPIC.

I haven’t actually used Flink before, but I assume the limitation here would be that for each record processed, the state of the Network won’t reflect the full history of previous records, only the subset that have been delivered to the same node.