SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

This might be interesting. It uses an interesting form of “sparsity” to alleviate the computational cost of DL.

More info for LSH in

Can be this used for a better encoder???

3 Likes

I think they did use a 44 core CPU which I would consider in the same bracket as a GPU. Though they didn’t use the full CPU single instruction multiple data instruction set. Or maybe they did unknowingly as their compiler autovectorized their code. Even the Java just in time hotspot compiler will autovectorize these days.

Anyway I am a big fan of random projection/locality sensitive hashing.
You can just do HD or HDHD etc. Where H is a matrix operation that in practice is replaced by the fast Walsh Hadamard transform and D is some random diagonal matrix with +1,-1 entries. That gives a random projection by sequency (similar to frequency) scrambling.
If you binarize the output of HD you have a fast locality sensitive hash.

Mild hashing (eg. HD) followed by nonlinearity (eg. binarization) allows the weighted sum to act as a general associative memory.
https://ai462qqq.blogspot.com/2019/11/artificial-neural-networks.html

3 Likes

I did a JS version of a Fixed Filter Bank neural network. You keep the neural network weights fixed (by using a fast transform for them) and adjust the non-linear activation functions instead:
View: https://editor.p5js.org/siobhan.491/present/Bgk9KvmMn
Code: https://editor.p5js.org/siobhan.491/sketches/Bgk9KvmMn

All the hardware and libraries neural network researchers use these days are adjusted and specialized for conventional O(n*n) nets. And all their high level work like GANs, ReNet etc. are based on the characteristics of that sort of net.
The are 90% locked in already. It would be rather an upheaval to have to go down into the basement, replace the broken light bulb, look around and tidy up the mess.

Here-s an updated paper on SLIDE, they report more improvements and provide proofing code on github.

Since the algorithm speculates sparse activation for each layer, I wonder if similar optimizations could be applied to HTM

If such techniques could scale on networked machines instead of cpu cores only, then lots of possibilities are opened.

1 Like