Fast FF networks speedups

Paper one Exponential Faster Language Modelling about using FFF in transformer FF blocks for training & inference.

Paper two Fast Feedforward Networks presenting the actual algorithm

It is a form of sparse execution (e.g. 0.3% of neurons being activated) instead of sparse representation (the HTM’s key feature).

Also interesting - “hacking” GPU for conditional matrix multiplication was used, and there is potential for native support in future hardware.

3 Likes

Quite interesting the inference pass works well with only 0.3% activation, it then raises the question as to how the learning phase could achieve a much lower activation during the learning pass.

Curious that RTE performance is significanltly worse and higher variability (1% activation 53.8 vs 0.5% activation 59.9 whilst 1.8% is 56.2) compared to BERT base. Could this indicate that those 90+% for RTE contain a high degree of subtle aggregate influences for inferred determination ?

1 Like

You mean to accelerate training too? That would be a more difficult problem, since training is done in batches and routing here (picking what neurons to activate) is computed from input on each FF block.
The only way I can think of is training it with a batch size of 1, but that isn’t going to accelerate things too much. In which case it might not make much sense to even use a GPU.
However,

  • the gating algorithm might stabilize the slower converging stochastic gradient descent.
  • parallel training on CPU would benefit from that each training step only updates 1% of the weights.
1 Like