Thanks @mrcslws, I would like this twice if I could.
Hi,
Just some initial responses to some of your comments/queries…
Thanks, I realize that deep structural changes in internal representations are not the kind of “tweak” that is going to happen ongoingly - so I’m good with just getting the next PR and capping it off at that.
-
I determined the SP state deviations (not really discrepancies) by monitoring and comparing each and every call to the RNG; and outputting traces of the
mapPotential
array andmapColumn()
return values, and the selected indices for each column’s pool. From that I determined that the initialization when using the UniversalRandom for both versions was exact.
This determination is (obviously) independent of data because the SP’s initialization requires no data. -
The finding that operation is equivalent for the first 42 lines was not intended to be a blanket assertion applicable across all circumstances. I looked at the overlaps; the boosted overlaps and the active columns as a comparison and like I said, it was with one specific NAB file - but it does give a sense of the consistency between the two versions - also given my statement that the unit tests are exactly the same also.
-
Unlike the TM, the SP makes absolutely no calls to the RNG during the
compute()
function (only during initialization), so floating point calculations are the only thing left to conclude as causing the increasing deviations. Especially given that the column activations are exactly the same at first and then start to deviate slowly.
The differences at line 42 are minimal (like one active column maybe two). But the differences in what columns become activated start to accumulate from there due to the cumulative floating point differences. (Logically, if all the calls to the RNG result in the same output during initialization, and the basic algorithms are the same, which is verified from the unit tests, then the fp differences resulting from the increment/decrement and multiplication operations are all that’s left). I determined this by looking at the differences in the calculations and the floating point arrays/lists used as additive and multiplicative arguments during exhaustive line by line debugging. It’s very clear from looking at the floating point numbers that their scale (number of digits to the right to the decimal) is different between the two languages, and that they do vary.
No. I didn’t investigate that. The precision of iteration 42 being the point of deviation (I felt) is not really that important as its understood that differences in input (from using different NAB data files) exciting different columns, could potentially result in deviations either earlier or later. I feel the main point is that initialization results in exactly the same pools and initial permanences; and that active columns during processing are identical at first and start to deviate expectedly in a manner consistent with fp differences.
Marcus is not. But with major internal differences, I would like to get the latest if at all possible. I’m as anxious as everyone else (if not more), to finally validate the NAB results, but I’m trying to be patient and reach a good stopping point and I ask that everyone else try to be patient as well.
I’m kind of unclear what the form of these “links” between NuPIC and HTM.Java will be. So far with the current version there’s been 2 or 3 major structural overhauls with the TM, but they’re all occurring within the same release. So as it stands, I can’t really link the current release until there is a release and the code freezes for a particular version, right? Otherwise, how do I “link” builds if I were going to do that? I could just document that HTM.Java’s version “X” is correlated to a particular NuPIC commit hash? Not sure how to treat this…
@alavin @mrcslws @lscheinkman @rhyolight
Status:
So while I was waiting for the new PR to be merged, I configured @lscheinkman’s NAB branch to run the locally deployed new version of HTM.Java and I got terrible scores:
Running score normalization step
Final score for 'htmjava' detector on 'reward_low_FP_rate' profile = -1.76
Final score for 'htmjava' detector on 'reward_low_FN_rate' profile = 5.28
Final score for 'htmjava' detector on 'standard' profile = 1.46
So far I’ve checked and proven:
- That the Network API has zero impact on the output because running both a raw assembly and the NAPI yields exactly the same results.
- That the new Java TM outputs exactly the same output as the Python version.
- That the new Java SP is initialized exactly the same as the Python version and initially runs yielding identical output (the variation deviating gradually due to differences in floating point slippage).
… so my intuition is saying that the problem is somewhere surrounding the NAB integrating the Java process, NAB detector, configuration or nab assembly. But I’m going to check the processing by doing NAB-like calculations outside of the nab infrastrucure to first verify that I can get good Anomaly results outside the NAB (NOTE: I’ve proven that I get the exact same Anomaly Scores as the Python TM using pre-computed SP output - so now I have to check the combination of the SP and TM)
So now I need to investigate:
- So now I have to check the combination of the SP and TM (as mentioned above, using the raw test harness). Scores should be in the same ballpark.
- Verify that configurations are what’s intended.
Questions/Help I need:
- Does the NAB perform any special “warm-up” (i.e. letting the SP run a bit before engaging the TM?)
- Can someone help me by giving me the formula for at least the “reward_low_FP_rate” calculation - like tell me how I could calculate it for a single file and multiple files? My intention is to do this outside the NAB and try and see where the problem is…
Thanks for your help!
David
BTW, let’s talk about this after we’ve established a release process using our new CI system.
Okay, but we’re here debugging htm.java because consistent unit tests != exact algorithms, as shown by the NAB performances.
This doesn’t make sense to me.
I was suggesting using another data file b/c it’s always best to run tests on different data to validate assumptions. Running the tests on a different NAB data file is trivial, and will either confirm what you’ve found so far or uncover something new.
No
Please see the NAB documentation as to how scores are calculated:
- Home · numenta/NAB Wiki · GitHub
- [1510.03336] Evaluating Real-time Anomaly Detection Algorithms - the Numenta Anomaly Benchmark
It’s straightforward to run a data file (or multiple) outside of NAB and output a CSV with the (raw) anomaly scores for each record, and then compare that to the CSV that results from running NAB.
Alex, please consider that comparing NAB to NAB includes the comparison of many discrete combinations of composite code. There’s the algorithm(s); the combination of the algorithms together; the framework which abstracts the running of the combination of algorithms; the NAB Detector which communicates across language process boundaries via the coopting of StandardIn/Out; the string conversion within the NAB Detector to convert the StandardOutput to Python internal data types; The NAB itself. I’m sure I’m leaving out many layers of “abstraction”…
My claim that algorithm vs. algorithm (i.e. TemporalMemory-to-TemporalMemory) being the same is perfectly sound and accurate. EDIT: This is also shown by the output comparison for the 4031 lines of Python and Java TM output I posted above…
I will run both the TM and SP comparison using another data file (the effort of which far exceeds claims of triviality because I have to read and write interim files and run/read them from different language test harnesses - not quite trivial - considering also the validation of the RNG in both languages to make sure the inputs conditions are the same). Also, considering the large number of things that have to come together in order to get exact output, the assumption of “sameness” between the TM’s is not to me an unfounded assumption? But I will work on it now and report back…
…also when I do, please take the time to look over the comparisons (and see the large number of different outputs that have to align in order to get the exact same output) so your instinct to dismiss my claims of sameness is backed by due diligence on your part?
I’m posting the previous comparison again here Python, Java so you can see what I mean? The output of both are huge, so you will have to download them in order to compare them (they are difficult to compare directly using the browser due to their size).
EDIT: By output interim files (above) I mean I have to run the Encoder and SP in Python, save the SP output, then run the saved SP output through the Python TM by itself, save that; then read the SP output in to Java and pipe that through the Java TM; save that - then compare… A tad bit more than “trivial”…
I’m looking for a little partnership here? This is a gargantuan undertaking, so any little help that Numenta can give me to avoid having to lump on huge research tasks would be greatly appreciated? You have no idea how much work it was to get this far… so I would appreciate more than just pithy RTFM comments? Lol!
I’m not following this thread closely, but here’s what my debugging instinct tells me: it’s worth taking time to build a clear understanding of NAB. I think this would be a fruitful place to spend your time.
At some point, yes I agree - but right now I need to get to the bottom of this by taking the shortest route. I don’t need to understand the theory of combustion engines to find the leak in the fuel line?
<sidenote>
@cogmission I understand you were essentially roped into this task when we tried to get HTM.Java running in NAB, and I understand that this has been a thorn in your side for far too long now. I also know that you are doing a great deal of this during your non-paid time (having a full-time job as well).
You deserve to be commended for your diligence. The effort you’ve put into updating HTM.Java algorithms has been huge. You are doing your best to hit a moving target. Thank you for all your work on HTM.
</sidenote>
Anyway, back to the question at hand… Why are HTM.Java scores so low? There are a lot of moving parts from input data to NAB score.
- data file
- file parsing and formatting
- choosing model parameters
- input data I/O into HTM
- retrieval of anomaly score
- anomaly likelihood calculation (is this even being done?)
- communicating results to NAB
I’m thinking about other ways to compare HTM.Java and NuPIC to better identify where in the chain this problem is occurring. One way is to feed the same input data into NuPIC and HTM.Java models using the same model parameters for each (as much as possible*). By plotting predictions and/or anomaly scores for both, we should be able to learn more about the characteristics of any divergence. There is already an HTM.Java example for hotgym, but it is outdated. David is going to update it later tonight. I’ve already run this example and compared it to the exact same data set as NuPIC.
It is not fair to compare the two yet because HTM.Java needs to be updated first. But if there is a significant divergence between the plots of HTM.Java anomaly scores vs NuPIC anomaly scores, then it narrows the playing field:
1. data file
2. file parsing and formatting
3. choosing model parameters
4. input data I/O into HTM
5. retrieval of anomaly score
6. anomaly likelihood calculation (is this even being done?)
7. communicating results to NAB
I still need to ensure that the model parameters are as similar as possible to those used in NAB. We’ll see what I can find out tomorrow.
* We also need to identify how model parameter declaration is different between the two systems. @lscheinkman mentioned at some point that he did not see a complete parity between the two.
Thanks Matt! Also Cortical.io (my bosses) should be commended for their support of HTM.Java also, because a lot of the “full-time-job time”, they are allowing me to spend on this NAB issue! So thanks CIO! But yeah, I’m doing this on my own time too!
This is essentially what I am doing - using a raw assembly of the algorithms (the test harness you and @alavin helped prepare). So yes I am running the same data; using the same RNG, and isolating the code I am testing with extreme discrimination - each step of the way…
I’ll never argue with more testing help though! Thanks Matt! Whatever you can uncover will only be helpful to the process!
WTF!!!
@alavin, @mrcslws @rhyolight
numpy.in1d() has a bug!
CAUGHT IN THE ACT!
Here’s a photo of my debugging session for confirmation.
(btw, don’t be confused by the pic - line 47 hasn’t executed yet…)
Esentially, given the spatial pooler output (sparse form):
[‘624’, ‘626’, ‘657’, ‘699’, ‘708’, ‘711’, ‘726’, ‘731’, ‘741’, ‘753’, ‘756’, ‘763’, ‘770’, ‘772’, ‘789’, ‘799’, ‘811’, ‘814’, ‘843’, ‘846’, ‘1654’, ‘1657’, ‘1658’, ‘1673’, ‘1682’, ‘1691’, ‘1701’, ‘1704’, ‘1710’, ‘1713’, ‘1719’, ‘1724’, ‘1725’, ‘1726’, ‘1734’, ‘1749’, ‘1753’, ‘1768’, ‘1769’, ‘1827’]
…and the previous predicted columns:
[731, 753, 763, 777, 1657, 1662, 1673, 1691, 1713, 1719, 1750, 1786, 1827]
in1d().sum()
should give you a count of “9” (731, 753, 763, 1657, 1673, 1691, 1713, 1719, 1827), but it is clearly showing in the debugger a score of “0” (when you look at the score that line 44 produced, it is zero).
the equation should be (40 - 9) / 40 == 0.775 !!!
Python gets 1.0 !!! ( 40 - 0) / 40 == 1.0 !!!
So again I say… WTF!!!
I think you’ll need to isolate this behavior in a short script and I’ll believe it then.
There’s no belief necessary - it’s right there plain as day! …and my file output corroborates it too! (see below)…
Look at record #2493 in both Python and Java…
Python:
--------------------------------------------
Record #: 2492
Raw Input: 2015-09-17 16:04:00, 194.0
TemporalMemory Input: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
TemporalMemory prev. predicted: [650, 651, 702, 731, 741, 753, 755, 756, 761, 763, 766, 770, 777, 799, 814, 816, 846, 1394, 1441, 1445, 1464, 1485, 1516, 1520, 1542, 1574, 1599, 1613, 1623, 1634, 1638, 1639, 1641, 1645, 1648, 1649, 1652, 1657, 1658, 1667, 1669, 1675, 1678, 1681, 1691, 1701, 1704, 1705, 1713, 1720, 1726, 1740, 1741, 1749, 1767, 1769, 1798, 1818, 1827, 1872]
TemporalMemory active: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
Anomaly Score: 0.45
--------------------------------------------
Record #: 2493
Raw Input: 2015-09-17 16:14:00, 245.0
TemporalMemory Input: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
TemporalMemory prev. predicted: [731, 753, 763, 777, 1657, 1662, 1673, 1691, 1713, 1719, 1750, 1786, 1827]
TemporalMemory active: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
Anomaly Score: 1.0
--------------------------------------------
Record #: 2494
Raw Input: 2015-09-17 16:24:00, 396.0
TemporalMemory Input: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
TemporalMemory prev. predicted: [710, 731, 735, 741, 753, 756, 763, 766, 777, 797, 799, 814, 816, 1445, 1464, 1520, 1657, 1658, 1673, 1682, 1713, 1720, 1725, 1729, 1737, 1749, 1768, 1769, 1786, 1827, 1829, 1899, 1904]
TemporalMemory active: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
Anomaly Score: 0.65
--------------------------------------------
Java:
--------------------------------------------
Record #: 2492
Raw Input: 2015-09-17 16:04:00,194
TemporalMemory Input: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
TemporalMemory prev. predicted: [650, 651, 702, 731, 741, 753, 755, 756, 761, 763, 766, 770, 777, 799, 814, 816, 846, 1394, 1441, 1445, 1464, 1485, 1516, 1520, 1542, 1574, 1599, 1613, 1623, 1634, 1638, 1639, 1641, 1645, 1648, 1649, 1652, 1657, 1658, 1667, 1669, 1675, 1678, 1681, 1691, 1701, 1704, 1705, 1713, 1720, 1726, 1740, 1741, 1749, 1767, 1769, 1798, 1818, 1827, 1872]
TemporalMemory active: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
Anomaly Score: 0.45
--------------------------------------------
Record #: 2493
Raw Input: 2015-09-17 16:14:00,245
TemporalMemory Input: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
TemporalMemory prev. predicted: [731, 753, 763, 777, 1657, 1662, 1673, 1691, 1713, 1719, 1750, 1786, 1827]
TemporalMemory active: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
Anomaly Score: 0.775
--------------------------------------------
Record #: 2494
Raw Input: 2015-09-17 16:24:00,396
TemporalMemory Input: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
TemporalMemory prev. predicted: [710, 731, 735, 741, 753, 756, 763, 766, 777, 797, 799, 814, 816, 1445, 1464, 1520, 1657, 1658, 1673, 1682, 1713, 1720, 1725, 1729, 1737, 1749, 1768, 1769, 1786, 1827, 1829, 1899, 1904]
TemporalMemory active: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
Anomaly Score: 0.65
--------------------------------------------
Your screenshot shows that activeColumns
is a list of strings…
I was moving stuff around in my script and switched arguments to the anomaly function instead of giving it:
map(int, spOutput)
Which converts the string entries to ints.
So, yeah that did fix that one entry…
--------------------------------------------
Record #: 2492
Raw Input: 2015-09-17 16:04:00, 194.0
TemporalMemory Input: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
TemporalMemory prev. predicted: [650, 651, 702, 731, 741, 753, 755, 756, 761, 763, 766, 770, 777, 799, 814, 816, 846, 1394, 1441, 1445, 1464, 1485, 1516, 1520, 1542, 1574, 1599, 1613, 1623, 1634, 1638, 1639, 1641, 1645, 1648, 1649, 1652, 1657, 1658, 1667, 1669, 1675, 1678, 1681, 1691, 1701, 1704, 1705, 1713, 1720, 1726, 1740, 1741, 1749, 1767, 1769, 1798, 1818, 1827, 1872]
TemporalMemory active: [721, 731, 733, 741, 753, 756, 760, 763, 777, 799, 814, 846, 1657, 1658, 1673, 1675, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1720, 1724, 1725, 1726, 1729, 1734, 1737, 1739, 1740, 1749, 1750, 1753, 1769, 1783, 1788, 1810, 1827]
Anomaly Score: 0.45
--------------------------------------------
Record #: 2493
Raw Input: 2015-09-17 16:14:00, 245.0
TemporalMemory Input: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
TemporalMemory prev. predicted: [731, 753, 763, 777, 1657, 1662, 1673, 1691, 1713, 1719, 1750, 1786, 1827]
TemporalMemory active: [624, 626, 657, 699, 708, 711, 726, 731, 741, 753, 756, 763, 770, 772, 789, 799, 811, 814, 843, 846, 1654, 1657, 1658, 1673, 1682, 1691, 1701, 1704, 1710, 1713, 1719, 1724, 1725, 1726, 1734, 1749, 1753, 1768, 1769, 1827]
here
Anomaly Score: 0.775
--------------------------------------------
Record #: 2494
Raw Input: 2015-09-17 16:24:00, 396.0
TemporalMemory Input: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
TemporalMemory prev. predicted: [710, 731, 735, 741, 753, 756, 763, 766, 777, 797, 799, 814, 816, 1445, 1464, 1520, 1657, 1658, 1673, 1682, 1713, 1720, 1725, 1729, 1737, 1749, 1768, 1769, 1786, 1827, 1829, 1899, 1904]
TemporalMemory active: [658, 669, 670, 678, 692, 700, 702, 713, 715, 727, 731, 741, 753, 756, 762, 763, 775, 799, 814, 824, 846, 857, 1654, 1657, 1658, 1662, 1673, 1675, 1682, 1691, 1693, 1710, 1713, 1719, 1725, 1726, 1734, 1753, 1771, 1827]
Anomaly Score: 0.65
--------------------------------------------
…it gets the correct score, but the others are the same!
There is something fishy until you can replicate this bug in one python script. I doubt that you’ve found a bug in numpy. Sorry man, the burden of proof is on you, and screenshots and pasted numbers don’t cut it. If you can show this bug with code in one script, I’ll believe you.
Yeah, I’m not sure why in1d()
would produce the same thing for a list of strings in some circumstances, but I have bigger fish to fry! Maybe laterz…