We were estimating the ability of HTM.java to perform well in the condition when the data stream is very fast, but found out that in our case it was capable of processing just 100 records per second. For our real business cases it is definitely NOT enough, and if you can help me to find out what I am doing wrong - I will be very grateful.
We’ve made a POC application in Spark which read the data from Kafka and processed with mapWithState in order to enrich the obtained records with the anomaly scores. The dataset was quite noisy - the data was taken from https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html - it was hourly ozone data for 2014-2016 years.
I’ve selected only those devices who has almost no missing values during the year (measurements are taken each hour). The file is located here:
Together with this I catched the flame graphs from the application using my method.
The SVG files of the flamegraphs can be found here:
and
I suggest to download them and open in Chrome - they are interactive.
First, we observed that fast serialization library works much worse than Kryo. But that was not a big deal for us making our beloved serialization (Kryo I mean).
The more important problem is this:
Half of HTM.java computation time we spend in ArrayUtils. I investigated this code a little bit and I see there some work with java.lang.reflect.Array which is slow.
Obvious question to authors: can’t it really be fastened somehow?