Anomaly detection on comments/chat data


#1

Hi,

I’m completely new to HTM. I’ve been searching this forum and the web for an answer to this question, but I couldn’t find anything of help. The question is:

Can I do anomaly detection on natural language data like this:

  1. ip_1, comment_1
  2. ip_1, comment_2
  3. ip_2, comment_3

The comments are strings (natural language). I want to detect anomalous comments, for example advertisement, repeated comments (not necessarily exact match), etc.

Can HTM help me in solving such a problem? If so, is there any example for implementing it (or a problem similar to this?)

Thank you very much


#2

Welcome, @ghmerti!

Before solving this from scratch, it would be worth looking at cortical.io, it can aid with detecting text similarity and building a semantic fingerprint.

As far as plain HTM goes, its current strength is more around sequence based prediction. So if you were trying to predict the next word following a sequence of words (or how anomalous a word is within the context of a sentence), it would be better suited. But it looks like most of what you’re solving is more of a classification problem (e.g. “does this sentence constitute advertising?”) which is the domain where deep learning currently excels.

I would add that IP address might not be an ideal way of identifying users, as these days you can have a very large number of people behind a single IP address.

Hope this helps, and I’d encourage you to learn HTM regardless of whether it’s is a good fit for your current scenario.