SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

This might be interesting. It uses an interesting form of “sparsity” to alleviate the computational cost of DL.

More info for LSH in

Can be this used for a better encoder???


I think they did use a 44 core CPU which I would consider in the same bracket as a GPU. Though they didn’t use the full CPU single instruction multiple data instruction set. Or maybe they did unknowingly as their compiler autovectorized their code. Even the Java just in time hotspot compiler will autovectorize these days.

Anyway I am a big fan of random projection/locality sensitive hashing.
You can just do HD or HDHD etc. Where H is a matrix operation that in practice is replaced by the fast Walsh Hadamard transform and D is some random diagonal matrix with +1,-1 entries. That gives a random projection by sequency (similar to frequency) scrambling.
If you binarize the output of HD you have a fast locality sensitive hash.

Mild hashing (eg. HD) followed by nonlinearity (eg. binarization) allows the weighted sum to act as a general associative memory.