Josh Millar just released our latest preprint on how to make sense of the growing number of dedicated, ultra-low-power 'neural network accelerators' that are found in many modern embedded chipsets. My interest in this derives from wanting to decouple from the cloud when it comes to low-latency local environments, and this needs fast tensor operations in hardware. Josh found a huge number of interesting NPUs in modern low-cost chips, ranging from ESP32-based boards over to ARM ones. All of these have quite a variety of tradeoffs, from the operations supported (which affects which models can be run on them) to the amount of memory and CPU power. This is the first comparative evaluation and independent benchmarking of several commercially-available micro-NPUs. We developed an open-source model compilation framework to enable consistent benchmarking across diverse hardware, measuring end-to-end performance including latency, power consumption, and memory overhead. The analysis uncovered surprising disparities between hardware specifications and actual performance, including unexpected scaling behaviors with increasing model complexity.