Note: this tools performs post-training quantization. To do quantization-aware training, follow the instructions here.
due to the immature state of MNN training framework, if you can not train your model with MNN, and the offline quantization tool gives a large drop of model accuracy, you can use MNNPythonOfflineQuant tool, this may solve your problem.
Advantages of quantization
Quantization can accelerate forward speed of the model by converting floating point computations in the original model into int8 computations. At the same time, it compresses the original model by approximately 4X by quantize the float32 weights into int8 weights.
Compile
Compile macro
In order to build the quantization tool, set MNN_BUILD_QUANTOOLS=true
when compiling.
Compile outputs
Quantization tool: quantized.out
Comparison tool(between floating point model and int8 quantized model): testQuanModel.out
Usage
Command
./quantized.out origin.mnn quan.mnn imageInputConfig.json
The first argument is the path of floating point model to be quantized.
The second argument indicates the saving path of quantized model.
The third argument is the path of config json file. you can refer to the template json file.
Json config file
format
Images are read as RGBA format, then converted to target format specified by format
.
Options: “RGB”, “BGR”, “RGBA”, “GRAY”
mean normal
The same as ImageProcess config
dst = (src - mean) * normal
width, height
Input width and height of the floating point model
path
Path to images that are used for calibrating feature quantization scale factors.
used_image_num
Specify the number of images used for calibration.
Default: use all the images under path
.
Note: please confirm that the data after the images are transformed by the above processes are the exact data that fed into the model input.
feature_quantize_method
Specify method used to compute feature quantization scale factor.
Options:
- “KL”: use KL divergence method, generally need 100 ~ 1000 images.
- “ADMM”: use ADMM (Alternating Direction Method of Multipliers) method to iteratively search for optimal feature quantization scale factors, generally need one batch images.
- “EMA”: use exponential moving average to caculate quantization factors for feature. the method uses asymmetirc quantization, and may have better accuracy. this method is the underlaying method for MNNPythonOfflineQuant. it’s recommended to reserve BatchNorm in your pb or onnx file, then use —forTraining of MNNConvert to convert your model to MNN, then use this MNN model with BatchNorm to do the quantization with EMA method. In addition, when using this method, the batch size in json should be set to almost the same as that when training.
default: “KL”
weight_quantize_method
Specify weight quantization method
Options:
- “MAX_ABS”: use the max absolute value of weights to do symmetrical quantization.
- “ADMM”: use ADMM method to iteratively find optimal quantization of weights.
default: “MAX_ABS”
Users can explore the above feature and weight quantization methods, and choose a better solution.
feature_clamp_value
Specify max range of quantized feature, default is 127. we symmetrically quantize feature into [-feature_clamp_value, feature_clamp_value]. sometimes, there will be some overflow (many feature pixel quantized to -127 or 127), this could bring loss to model accuracy. you can decrease feature_clamp_value to a smaller number, like 120, to reduce overflow, thus, improve quantized model accuracy. Note that, reduce this clamp value too much will bring a large drop on model accuracy.
weight_clamp_value
Specify max range of quantized weight, default is 127. this option works similar to feature_clamp_value. Note that, precision of weight quantization is more important than that of feature quantization. so if you want to reduce overflow, you can adjust feature_clamp_value first to see the result.
skip_quant_op_names
Specify the names of conv ops you don’t want to quantize. Because some some ops precision is vary important to model’s accuracy, like the first conv layer, the basic feature extractor. you can use this option to skip quantization of these ops. you can use netron to visualize mnn model to get the ops’ names
batch_size
batch size used for EMA feature quantization method, should be set to almost the same as that when training.
debug
Whether or not to show debug information. default is false. debug info includes cosine distance between the tensors of original model and quantized model, and overflow rate of each conv op.
Usage of quantized model
The same as floating point model. The inputs and outputs of quantized model are also floating point.
References
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/16767/16728