Encoding high dimensional vector data/ word embeddings

Hello everyone! I am currently trying to use an HTM model to detect anomalies in log event data (HDFS/BGL logs). However, my current approach involves converting the log events to high dimensional vectors using a word2vec model with dimensions >30.

Since the fields here are correlated to one another, building different models for each field is not an option. Are there any specific encoding methods for vectors with high dimensions? Or is the only option to reduce the dimensionality somehow and use scalar encoders?

Thank you in advance!

Hi, how many dimensions and data points do you actually have?

Hey! I currently have 32 dimensions and around 57000 rows.

You have best chances with an encoding that preserves similarity. I would try fly hash encoder, which roughly is a simple random projection from a (relatively) low dimension space of floats to a (relatively) higher dimensional space of sparse bits (== SDRs)

If you have no idea what I’m talking about here-s an article to begin with. I found this image quite relevant.

My attempt at implementing a simple one was quite satisfying on MNIST, if you want to try it on your data too, I can help explaining what it actually does, if the source seems too cryptic.

1 Like

I assume each 32 float vector corresponds to a single log line.
What is important is to have the time of each event occurrence encoded within the vector itself, preferably with words each catching a different “aspect” within time - year, month, day of week, day of month, hour, minute, seconds.
That would be more “insightful” than representing time as a float e.g.

[1552622.974721] smpboot: Booting Node 0 Processor 3 APIC 0x3
[1552622.975300] CPU3 is up
[1552622.976639] ACPI: Waking up from system sleep state S3
[1552622.998637] sd 1:0:0:0: [sda] Starting disk
[1552623.200812] OOM killer enabled.
[1552623.200813] Restarting tasks … done.
[1552623.205668] video LNXVIDEO:00: Restoring backlight state
[1552623.205676] PM: suspend exit
[1552623.375128] e1000e 0000:00:19.0 enp0s25: NIC Link is Down
[1552623.436960] ata2.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[1552623.436966] ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[1552623.436968] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[1552626.354200] e1000e 0000:00:19.0 enp0s25: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[1552626.354257] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready

Thank you so much for your detailed write-up. I am currently reading through the article and I believe the fly hash encoder is exactly what I am looking for.

Thank you again for sharing your implementation as well. I will try to post the results once I finish my project.