I have a question regarding the way HTM learns long-range dependencies, and in particular how it compares to the way LSTM does it.
It is my understanding that in order to learn long-range dependencies between two events A and B separated by x number of events, HTM would need to memorize the entire sequence of x events, even though those events may not be causally related to A or B. In contrast, LSTM does not need to learn the entire sequence, thanks to its gating mechanism. Rather, it learns a direct relationship between A and B. Am I understanding correctly that currently, in the particular scenario where the x events are not causally related to A or B, HTM cannot actually learn a direct connection between A and B?
I am aware of the good results achieved by HTM on long sequences learning as shown by Cui et al. (2016). However, the need for HTM to learn the entire sequence, instead of learning a direct connection between A and B, seems like a important limitation. Would you agree on that?
It seems you’re right, especially if the x-event sequences btwn B and A are not steadily predictable. Even if they were, the longer they get the more repetitions of the overall sequences the TM would need to learn that total context. I think connecting A to B in those cases at least calls for minimal noise and higher repetitions.
Definitely a valid criticism and an actual problem in many cases.
There is actually a way to do it with HTM but I do not think it is practical for most of the use cases. Temporal Memory ™ can be configured such that the current active cells representing B at time t, not only form connections to the previously active cells at time t-1 (last event of x) but also to the cells prior to that (t-n). If you run it long enough the causality of x events would not be captured but A would. However, this is not how vanilla TM works and if it is configured as such, the predictions caused by a single activation may not be useful due to many false positives.
Yes, this is correct for the HTM Temporal Memory algorithm. We discussed this (and a couple of other limitations) in the Cui et al (2016) paper (see the third limitation in Section 6.4). We used a variation of your example, the Reber grammar task, as an illustration of this.
It is quite possible that we could extend the algorithm to handle these cases, but we have not focused on this. It would be great if someone wanted to tackle it.
Could you please elaborate on how this configuration of TM might look like? I have found this discussion on using graded SDRs to improve random access of SDRs quite helpful. Does your answer relate to this in anyway?
I would really appreciate any help on this matter since I am studying the problem of learning long-range dependencies in HTM, using the cases of Extended and Continual Reber Grammars. Thanks