Distillation training basics

The general idea of distillation is to transfer the knowledge learned from one model to another, just like teachers teach students, so the former model is often called a teacher model, the latter model is often called a student model. If the student model is smaller than the teacher model, distillation also becomes a model compression method. Hinton proposed the idea of distillation in 2015. For specific methods, please refer to the paper:
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

MNN distillation training example

Take the distillation training quantification of MobilenetV2 as an example. Let’s take a look at how to do distillation training in MNN. The relevant code is``MNN_ROOT/tools/train/source/demo/`` In distillTrainQuant.cpp.
According to the distillation algorithm, we need to take out the logits input to the Softmax node of the model, add the temperature parameter, and finally calculate the distillation loss for training.
Note that the MNN model of MobilenetV2 is required in this demo.

// distillTrainQuant.cpp
......
// Reads the teacher MNN model 
auto varMap      = Variable::loadMap(argv[1]);
if (varMap.empty()) {
    MNN_ERROR("Can not load model %s\n", argv[1]);
    return 0;
}
......
// Gets the inputs and outputs of the teacher model.
auto inputOutputs = Variable::getInputAndOutput(varMap);
auto inputs       = Variable::mapToSequence(inputOutputs.first);
MNN_ASSERT(inputs.size() == 1);
// Input node of the teacher model
auto input = inputs[0];
std::string inputName = input->name();
auto inputInfo = input->getInfo();
MNN_ASSERT(nullptr != inputInfo && inputInfo->order == NC4HW4);
// Output node of the teacher model
auto outputs = Variable::mapToSequence(inputOutputs.second);
std::string originOutputName = outputs[0]->name();
// The node before Softmax in the teacher model, i.e. logits
std::string nodeBeforeSoftmax = "MobilenetV2/Predictions/Reshape";
auto lastVar = varMap[nodeBeforeSoftmax];
std::map<std::string, VARP> outputVarPair;
outputVarPair[nodeBeforeSoftmax] = lastVar;
// Extracts the part of model from the input node to the logits output 
auto logitsOutput = Variable::mapToSequence(outputVarPair);
{
    auto exe = Executor::getGlobalExecutor();
    BackendConfig config;
    exe->setGlobalExecutorConfig(MNN_FORWARD_CPU, config, 4);
}
// Converts original model (from the input to logits) into a trainable float model.
std::shared_ptr<Module> model(PipelineModule::extract(inputs, logitsOutput, true));
// Converts the above model into a quantized model.
PipelineModule::turnQuantize(model.get(), bits);
// The original model does not train and will only perform forward inference pass.
std::shared_ptr<Module> originModel(PipelineModule::extract(inputs, logitsOutput, false));
// Begin training.
_train(originModel, model, inputName, originOutputName);

OK, the above demonstrates how to obtain the logits output and convert the model into a training quantization model. Let’s take a look at the key parts of the code for implementing quantization in the training project.

// A forward pass during training.
// Converts the input data into NC4HW4 used by MNN internally.
auto nc4hw4example = _Convert(example, NC4HW4);
// The forward pass of the teacher model. Gets the logits output of the teacher model.
auto teacherLogits = origin->forward(nc4hw4example);
// The forward pass of the student model. Gets the logits output of the student model.
auto studentLogits = optmized->forward(nc4hw4example);
// Calculate the One-Hot vector of the label.
auto labels = trainData[0].second[0];
const int addToLabel = 1;
auto newTarget = _OneHot(_Cast<int32_t>(_Squeeze(labels + _Scalar<int32_t>(addToLabel), {})),
                         _Scalar<int>(1001), _Scalar<float>(1.0f),
                         _Scalar<float>(0.0f));
// Use the logits of the teacher model and the student model and the true label to calclate loss.
// Temperature T = 20, softTargets loss coefficient = 0.9
VARP loss = _DistillLoss(studentLogits, teacherLogits, newTarget, 20, 0.9);

Let’s take a look at how distillation loss is calculated. The code is in MNN_ROOT/tools/train/source/optimizer/Loss.cpp

// Loss.cpp
Express::VARP _DistillLoss(Express::VARP studentLogits, Express::VARP teacherLogits, Express::VARP oneHotTargets, const float temperature, const float alpha) {
    auto info = teacherLogits->getInfo();
    if (info->order == NC4HW4) {
        teacherLogits = _Convert(teacherLogits, NCHW);
        studentLogits = _Convert(studentLogits, NCHW);
    }
    MNN_ASSERT(studentLogits->getInfo()->dim.size() == 2);
    MNN_ASSERT(studentLogits->getInfo()->dim == teacherLogits->getInfo()->dim);
    MNN_ASSERT(studentLogits->getInfo()->dim == oneHotTargets->getInfo()->dim);
    MNN_ASSERT(alpha >= 0 && alpha <= 1);
    // Calculates softTargets of the teacher model after considering the temperature.
    auto softTargets = _Softmax(teacherLogits * _Scalar(1 / temperature));
    // Calculates the prediction of the student model after considering the temperature.
    auto studentPredict = _Softmax(studentLogits * _Scalar(1 / temperature));
    // Calculates the loss for softTargets.
    auto loss1 = _Scalar(temperature * temperature) * _KLDivergence(studentPredict, softTargets);
    // Calculates the loss for the label.
    auto loss2 = _CrossEntropy(_Softmax(studentLogits), oneHotTargets);
    // Total loss is a weight sum of the above two losses.
    auto loss = _Scalar(alpha) * loss1 + _Scalar(1 - alpha) * loss2;
    return loss;
}

English Document

Model Distillation

Distillation training basics

MNN distillation training example