Hugely impressed with the results, particularly the performance. Some thoughts and observations.
Page 7, figure 6, 8-core CPU (no detail on memory and channel qty) which appears to mean that AVX is totall irrelevant as it’s a memory bound compute issue. (ref 625x theoretical) The correct scale is lost due to the lack of detail. Running on a laptop with a small CPU cache and potentially with only a single memory channel populated is >10x different to a Threadripper/EPYC with 8 channels depending on how the code is also made parallel.
Also, if the model would fit in one of the larger EYPC CPU caches the increase in performance would be significant enough to change some of your conclusions. Try running the model so that it’s in CPU cache only to try and quantify this difference, which is influenced by the CPU strangler known as DIMM.
Switching the CPU to a Xeon 8275CL (also with no memory channel details) then eliminates consistenty comparison for Fig 8. Single CPU setup or dual ? NUMA performance hit ?
Intel X8275CL 35MB cache / 6 memory channels / 24 core
EPYC 7763 256MB cache / 8 memory channels / 64 core
The cache can make a 10x difference on some code.
Figure 13 (c) “CPUs” - is this then implying more than one CPU and then a third CPU hardware configuration ?
If the FPGA could implement a fully pipelined model the results would have been over 100x faster again for aggregate throughput but single response latency would be far slower than a GPU. The FPGA’s are just not big enough yet…
The FPGA results give a speech recognition rate of 15.8 days audio per second, which if spread over say 12 hours “awake” time that means 1 month per second. This would then imply that the FPGA can recognise a lifetime of word audio recognition in under 16 minutes. Why do we think this is slow ?
Also, the energy to recognise a lifetimes audio is then 54Wh or less energy than is stored in my flashlight. (based on 215W spec sheet of U250 and my flashlight with 3 x 26650 19.5Wh cells). The typical Alexa device consumes that in just over a day on standby (at 2W)… hmmm… so… The compute recognition is energy then 0.00392% of the device footprint.