I’m completely new to HTM. I’ve been searching this forum and the web for an answer to this question, but I couldn’t find anything of help. The question is:
Can I do anomaly detection on natural language data like this:
ip_1, comment_1
ip_1, comment_2
ip_2, comment_3
The comments are strings (natural language). I want to detect anomalous comments, for example advertisement, repeated comments (not necessarily exact match), etc.
Can HTM help me in solving such a problem? If so, is there any example for implementing it (or a problem similar to this?)
Before solving this from scratch, it would be worth looking at cortical.io, it can aid with detecting text similarity and building a semantic fingerprint.
As far as plain HTM goes, its current strength is more around sequence based prediction. So if you were trying to predict the next word following a sequence of words (or how anomalous a word is within the context of a sentence), it would be better suited. But it looks like most of what you’re solving is more of a classification problem (e.g. “does this sentence constitute advertising?”) which is the domain where deep learning currently excels.
I would add that IP address might not be an ideal way of identifying users, as these days you can have a very large number of people behind a single IP address.
Hope this helps, and I’d encourage you to learn HTM regardless of whether it’s is a good fit for your current scenario.