Note: this tools performs post-training quantization. To do quantization-aware training, follow the instructions here.
due to the immature state of MNN training framework, if you can not train your model with MNN, and the offline quantization tool gives a large drop of model accuracy, you can use MNNPythonOfflineQuant tool, this may solve your problem.

Advantages of quantization

Quantization can accelerate forward speed of the model by converting floating point computations in the original model into int8 computations. At the same time, it compresses the original model by approximately 4X by quantize the float32 weights into int8 weights.

Compile

Compile macro

In order to build the quantization tool, set MNN_BUILD_QUANTOOLS=true when compiling.

Compile outputs

Quantization tool: quantized.out
Comparison tool(between floating point model and int8 quantized model): testQuanModel.out

Usage

Command

  1. ./quantized.out origin.mnn quan.mnn imageInputConfig.json

The first argument is the path of floating point model to be quantized.
The second argument indicates the saving path of quantized model.
The third argument is the path of config json file. you can refer to the template json file.

Json config file

format

Images are read as RGBA format, then converted to target format specified by format.
Options: “RGB”, “BGR”, “RGBA”, “GRAY”

mean normal

The same as ImageProcess config
dst = (src - mean) * normal

width, height

Input width and height of the floating point model

path

Path to images that are used for calibrating feature quantization scale factors.

used_image_num

Specify the number of images used for calibration.
Default: use all the images under path.

Note: please confirm that the data after the images are transformed by the above processes are the exact data that fed into the model input.

feature_quantize_method

Specify method used to compute feature quantization scale factor.
Options:

  1. “KL”: use KL divergence method, generally need 100 ~ 1000 images.
  2. “ADMM”: use ADMM (Alternating Direction Method of Multipliers) method to iteratively search for optimal feature quantization scale factors, generally need one batch images.
  3. “EMA”: use exponential moving average to caculate quantization factors for feature. the method uses asymmetirc quantization, and may have better accuracy. this method is the underlaying method for MNNPythonOfflineQuant. it’s recommended to reserve BatchNorm in your pb or onnx file, then use —forTraining of MNNConvert to convert your model to MNN, then use this MNN model with BatchNorm to do the quantization with EMA method. In addition, when using this method, the batch size in json should be set to almost the same as that when training.

default: “KL”

weight_quantize_method

Specify weight quantization method
Options:

  1. “MAX_ABS”: use the max absolute value of weights to do symmetrical quantization.
  2. “ADMM”: use ADMM method to iteratively find optimal quantization of weights.

default: “MAX_ABS”

Users can explore the above feature and weight quantization methods, and choose a better solution.

feature_clamp_value

Specify max range of quantized feature, default is 127. we symmetrically quantize feature into [-feature_clamp_value, feature_clamp_value]. sometimes, there will be some overflow (many feature pixel quantized to -127 or 127), this could bring loss to model accuracy. you can decrease feature_clamp_value to a smaller number, like 120, to reduce overflow, thus, improve quantized model accuracy. Note that, reduce this clamp value too much will bring a large drop on model accuracy.

weight_clamp_value

Specify max range of quantized weight, default is 127. this option works similar to feature_clamp_value. Note that, precision of weight quantization is more important than that of feature quantization. so if you want to reduce overflow, you can adjust feature_clamp_value first to see the result.

skip_quant_op_names

Specify the names of conv ops you don’t want to quantize. Because some some ops precision is vary important to model’s accuracy, like the first conv layer, the basic feature extractor. you can use this option to skip quantization of these ops. you can use netron to visualize mnn model to get the ops’ names

batch_size

batch size used for EMA feature quantization method, should be set to almost the same as that when training.

debug

Whether or not to show debug information. default is false. debug info includes cosine distance between the tensors of original model and quantized model, and overflow rate of each conv op.

Usage of quantized model

The same as floating point model. The inputs and outputs of quantized model are also floating point.

References

Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/16767/16728