Last time I investigated deep learning networks on a Raspberry Pi. They didn’t perform very well. The main reason is that deep learning networks are very resource hungry. They need a lot of memory to store their weights and a lot of computer power.

    Is there nothing we can do? Yes, there are some challenging ideas about running deep learning networks with a good FPS, even on a Raspberry.

    Let’s go back to the basic, a neural node. They are at the very heart of any network.

    The output is the sum of all the inputs multiplied with all their corresponding weights. The ReLU is a so-called activation function. Even a relatively small and simple network as AlexNet has 60 million weights, consuming 0.5 Gbyte of memory. It needs roughly 727 MFLOPS to calculate the outcome of a single frame. No match for our Raspberry.

    Strategies to improve the performance are based upon the idea to decrease the memory load. As a result, the calculation load will usually drop too.

    The most simple method is replacing all double precision real numbers for 8-bit integers. TensorFlow Lite uses this technique for instance. The memory load of AlexNet becomes now 60 Mbyte which can be allocated in a Raspberry. At the same time, the network execution time is reduced by a factor of two, because multiplying floating points takes twice as much time as integers. However, the computational burden still stays substantial.

    The second method is pruning. All inputs with negligible influence on the output are deleted. As expected, the more weights are pruned, the greater the overall error becomes. Below a graph.

    The big problem with pruning is software implementation. All input-weight calculations are coded in for-loops. Now, the loop becomes gaped, open places where the input is pruned. In one way or another, this generates always conditional jumps, slowing down the execution. Not to speak about GPU acceleration. The architecture of the GPU is matrix optimized, badly capable of branching. Also not the solution for a Raspberry.

    There are many other similar techniques, all trying to reduce the number of weights. Some better than others. However, they all are not capable of reducing the execution time significantly because they all use time-consuming multiplications.

    A complete other approach is the replacement of the weights by a single bit, giving you a binary neural network (BNN). Below a picture of such an element.

    Now only the sign of the weight is used. If the initial weight was greater than 0 it becomes +1, otherwise -1. Looking at the calculation, all multiplications are now replaced by adds and subs. And these are far less time-consuming.

    A logic step further is also the replacement of all inputs with a single bit. It seems at first glance a very radical method with dubious results. However, it appears to perform reasonably well with just a little less accuracy.

    The multiplications are now replaced by simple logic XNOR operator. Below the truth table of an XNOR.

    When the -1 are represented by zeros in the software, the multiplication works correctly. Hence the name of the network, an XNOR network.

    A simple XOR on the 64-bit Raspberry gives you now 64 multiplications in a single instruction. That speeds things up!

    Below an overview of the three types.

    The XNOR operation between an input and its weights results in a binary number. The sum of all 1’s in this number forms the output. In the end, some threshold is applied to get the new binary input for the next layer.

    XNOR multiplications at work

    This XNOR technique is very promising. It gives good results with nice FPS even on a Raspberry Pi.

    Much more information about this topic and initial software can be found here: https://qengineering.eu/deep-learning-with-fpga-aka-bnn.html

    See also part 1 on Hackster.io: https://www.hackster.io/tinus-treuzel/deep-learning-with-raspberry-pi-explored-5fa573

    This time the picture above is the Rainbow flat in Hong Kong (22° 20’ 6’’ N, 114° 12’ 24’’ E). It just looks like a floorplan of an FPGA with its LUTs and L2 cache in the middle.

    Enjoy!