Quite interesting the inference pass works well with only 0.3% activation, it then raises the question as to how the learning phase could achieve a much lower activation during the learning pass.
Curious that RTE performance is significanltly worse and higher variability (1% activation 53.8 vs 0.5% activation 59.9 whilst 1.8% is 56.2) compared to BERT base. Could this indicate that those 90+% for RTE contain a high degree of subtle aggregate influences for inferred determination ?
You mean to accelerate training too? That would be a more difficult problem, since training is done in batches and routing here (picking what neurons to activate) is computed from input on each FF block.
The only way I can think of is training it with a batch size of 1, but that isn’t going to accelerate things too much. In which case it might not make much sense to even use a GPU.
However,
the gating algorithm might stabilize the slower converging stochastic gradient descent.
parallel training on CPU would benefit from that each training step only updates 1% of the weights.