0. Overview
1. Add model description
- Add Op Type
- Add Op Parameter
2. Add model conversion
3. Add shape calculation
4. Add Implementation

0. Overview

Before adding Op, please refer to Op Manual to avoid unnecessary duplication.

In MNN, adding Op consists of the following steps:

Add model description
Add model conversion
Add shape calculation
Add implementation

1. Add model description

After modifying the model description, you need to run the generate script to regenerate the model description header file.

Add Op Type

Append the operator name to the OpType list in schema/default/MNN.fbs, such as:

enum OpType : int {
    AbsVal,
    QuantizedAdd,
    ...
    MyCustomOp
}

Add Op Parameter

If the operator does not contain parameters, you can skip this step.

First, append the operator parameter name to the OpParameter list in schema/default/MNN.fbs, such as:

union OpParameter {
    QuantizedAdd,
    ArgMax,
    AsString,
    ...
    MyCustomOpParam
}

Then add a parameter description. If the operator is from Caffe, choose CaffeOps.fbs; if the operator is from TensorFlow, use TensorflowOp.fbs.

table MyCustomOpParam {
    padX:int;
    padY:int;
    kernelX:int;
    kernelY:int;
    strideX:int;
    strideY:int;
    dataType:DataType=DT_FLOAT;
}

2. Add model conversion

After adding the model conversion, you need to re-run cmake.

Currently, MNN supports the conversion from TensorFlow, TensorFlow Lite, Caffe and ONNX model.

Tensorflow Model Convert

Add conversion class
Add MyCustomOpTf.cpp under tools/converter/source/tensorflow. You can declare the conversion class directly, or you can use a macro definition to simplify the code.

Direct declaration example:

class MyCustomOpTf : public tfOpConverter {                                             
    public:                                                                       
        virtual void run(MNN::OpT *dstOp, TmpNode *srcNode, TmpGraph *tempGraph);
        MyCustomOpTf() {}                                                                   
        virtual ~MyCustomOpTf() {}                                                          
        virtual MNN::OpType opType();                                             
        virtual MNN::OpParameter type();                                          
}

Equivalent macro definition example:

DECLARE_OP_CONVERTER(MyCustomOpTf);

Need to implement run, destructor, opType and type functions. Among them, the run function is used to parse the model’s proto file to get the parameters, and then assign them to the flatbuffer custom parameters. The parameter srcNode holds the input and output node information, and the TmpNode can be found in the tempGraph according to the input and output nodes. Call the function find_attr_value(const tensorflow::NodeDef&, const char*, tensorflow::AttrValue&) to get the value of the corresponding parameter.

REGISTER_CONVERTER(MyCustomOpTf, MyCustomOp);

Add mapping
Add the corresponding TensorFlow Op name to the MNN Op name mapping in OpMapper.hpp:
```
{"OpName1", MNN::OpType_MyCustomOp},
{"OpName2", MNN::OpType_MyCustomOp},
```
Handling Op with Const
If Const is not treated as a parameter of Op, but as a separate Op, you can ignore this step; if Const is treaded as a parameter of Op, modify the function _genMinGraph() in TmpGraph.cpp , set isCovered property of corresponding Const node to true.

Tensorflow Lite Model Convert

Add conversion class
Add MyCustomOpTflite.cpp under tools/converter/source/tflite.

Macro definition example:

DECLARE_OP_COVERTER(MyCustomOpTflite);

Need to implement functions:

MyCustomOpTflite::opType(bool quantizedModel);
MyCustomOpTflite::type(bool quantizedModel);
MyCustomOpTflite::run(MNN::OpT *dstOp, 
                      const std::unique_ptr<tflite::OperatorT> &tfliteOp, 
                      const std::vector<std::unique_ptr<tflite::TensorT> > &tfliteTensors,
                      const std::vector<std::unique_ptr<tflite::BufferT> > &tfliteModelBuffer,
                      const std::vector<std::unique_ptr<tflite::OperatorCodeT> > &tfliteOpSet,
                      bool quantizedModel)

Among them, the run function has one more quantizedModel parameter than the version of TensorFlow. If the quantizedModel is true, the model is a quantitative model and needs to be converted to the corresponding quantified Op; if it is false, it is converted to floating point Op. In the run function, you need to set the index of the input and output tensor:

// set input output index
dstOp->inputIndexes.resize(1);
dstOp->outputIndexes.resize(1);
dstOp->inputIndexes[0]  = tfliteOp->inputs[0];
dstOp->outputIndexes[0] = tfliteOp->outputs[0];

using namespace tflite;
REGISTER_CONVERTER(MyCustomOpTflite, BuiltinOperator_OPName);

Caffe Model Convert

Add conversion class
Add MyCustomOp.cpp under/tools/converter/source/caffe.

Class declaration example:

class MyCustomOp : public OpConverter {
public:
    virtual void run(MNN::OpT* dstOp, 
                     const caffe::LayerParameter& parameters, 
                     const caffe::LayerParameter& weight);
    MyCustomOp() {}
    virtual ~MyCustomOp() {}
    virtual MNN::OpType opType();
    virtual MNN::OpParameter type();
};

Implement the run, opType, and type functions, and parse the caffe parameter in the run function to get the specific parameters. The parameters parameter stores the parameter information of Op, and the weight stores data parameters such as convolution and BN.

static OpConverterRegister<MyCustomOp> a("MyCustomOp");

ONNX Model Convert

Add conversion class
Add MyCustomOpOnnx.cpp under/tools/converter/source/onnx.

Macro definition example:

DECLARE_OP_CONVERTER(MyCustomOpOnnx);

Need to implement functions:

MNN::OpType MyCustomOpOnnx::opType();
MNN::OpParameter MyCustomOpOnnx::type();
void MyCustomOpOnnx::run(MNN::OpT* dstOp, 
                         const onnx::NodeProto* onnxNode, 
                         std::vector<const onnx::TensorProto*> initializers);

In the run function, onnxNode contains the ONNX original node information. weight and other data information needs to be taken from the initializers.

REGISTER_CONVERTER(MyCustomOpOnnx, MyCustomOp);

3. Add shape calculation

After adding the shape calculation code, you need to re-run cmake.

Add calculation class
Add ShapeMyCustomOp.cpp to /source/shape.

class MyCustomOpSizeComputer : public SizeComputer {
public:
 virtual bool onComputeSize(const MNN::Op* op, const std::vector<Tensor*>& inputs,
                            const std::vector<Tensor*>& outputs) const override {
     // set tensor->buffer.type
     //                   .dimensions
     //                   .dim[x].extent
     //                    .dim[x].stride
     //                    .dim[x].flag
     return true;
 }
 virtual float onComputeFlops(const MNN::Op* op, 
                              const std::vector<Tensor*>& inputs,
                              const std::vector<Tensor*>& outputs) const {
     return flops_for_calc_output_from_input;
 }
};

In the onComputeSize function, the dimension information of the output tensor is calculated according to the dimension information of the input tensor, and the data type of the output tensor is set. Returns true if the calculation is succeed; returns false if the input dimension information is unknown.
In the onComputeFlops function, the total calculation amount is returned according to the dimension information of the input and output tensor.

REGISTER_SHAPE(MyCustomOpSizeComputer, OpType_MyCustomOp);

4. Add Implementation

Add CPU Implementation

Add CPUMyCustomOp.hpp and CPUMyCustomOp.cpp to source/backend/CPU.

Implementation class declaration

class CPUMyCustomOp : public Execution {
public:
 virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, 
                            const std::vector<Tensor *> &outputs) override;
 virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, 
                             const std::vector<Tensor *> &outputs) override;
};

Implement onResize and onExecute

In onResize, call backend()->onAcquireBuffer(&mCache, Backend::DYNAMIC) to allocate the cache, and call backend()->onReleaseBuffer(&mCache, Backend::DYNAMIC) to reclaim the cache. The released memory can be reused.
In onExecute, doing the necessary input checks will help you find the problem ahead of time. If the execution is completed, it returns NO_ERROR correctly.

class CPUMyCustomOpCreator : public CPUBackend::Creator {
public:
 virtual Execution *onCreate(const std::vector<Tensor *> &inputs, 
                             const std::vector<Tensor *> &outputs, 
                             const MNN::Op *op,
                             Backend *backend) const override {
     return new CPUMyCustomOp(backend);
 }
};
REGISTER_CPU_OP_CREATOR(CPUMyCustomOpCreator, OpType_MyCustomOp);

Add Metal Implementation

Add Shader
Add MetalMyCustomOp.metal in the source/backend/Metal directory and add it to the Xcode project. Metal shader can refer to the existing implementations.

Implementation class declaration
Add MetalMyCustomOp.hpp and MetalMyCustomOp.cpp in the source/backend/Metal directory and add them to the Xcode project:

class MetalMyCustomOp : public Execution {
public:
 virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, 
                            const std::vector<Tensor *> &outputs) override;
 virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, 
                             const std::vector<Tensor *> &outputs) override;
};

Implement onResize and onExecute
Unlike CPU Tensor, which stores data in the host pointer, the Metal data pointer is stored in the deviceId, and the deviceId stores the id<MTLBuffer>:
```
auto buffer = (__bridge id<MTLBuffer>)(void *)tensor->deviceId();
```

Specific parameters of Metal Op can be stored by id<MTLBuffer>. Different from tensor, the buffer can mix multiple data types, just ensure that the correct length is specified when creating. E.g:

auto buffer = [context newDeviceBuffer:2 * sizeof(int) + 2 * sizeof(__fp16) access:CPUWriteOnly];
((__fp16 *)buffer.contents)[0] = mAlpha / mLocalSize;  // alpha
((__fp16 *)buffer.contents)[1] = mBeta;                // beta
((int *)buffer.contents)[1] = mLocalSize;              // local size
((int *)buffer.contents)[2] = inputs[0]->channel();    // channel

When creating a buffer, you need to specify access control permissions. There are currently three permissions:

CPUReadWrite, data is shared between CPU/GPU, generally used for device buffer;
CPUWriteOnly, the data is not read after being written by the CPU, and is generally used for the parameter buffer;
CPUTransparent, the data is only in the GPU, generally used in the heap buffer.

MNNMetalContext has two sets of similar interfaces to create buffer, the difference is only in the life cycle of the data:

the memory occupied by the device will not be reused in the single inference process;
the memory occupied by the heap is reused by other Ops after calling -[MNNMetalContext releaseHeapBuffer:].

In general, the heap will only be used with CPUTransparent. heap only aviliable on iOS 10+，fall back to device on iOS 9

When using Metal, It is forbidden to create device and library yourself if it is not a special case. Loading the library and compiling the function are time-consuming behaviors, and MNNMetalContext does the necessary cache optimization. An example of executing Metal via context is as follows:

auto context   = (__bridge MNNMetalContext *)backend->context();
auto kernel    = /* metal kernel name NSString */;
auto encoder   = [context encoder];
auto bandwidth = [context load:kernel encoder:encoder];
/* encoder set buffer(s)/sampler(s) */
[context dispatchEncoder:encoder 
                 threads:{x, y, z}
      maxThreadsPerGroup:maxThreadsPerThreadgroup]; // recommended way to dispatch
[encoder endEncoding];

class MetalMyCustomOpCreator : public MetalBackend::Creator {
public:
 virtual Execution *onCreate(const std::vector<Tensor *> &inputs, 
                             const MNN::Op *op, Backend *backend) const {
     return new MetalMyCustomOp(backend);
 }
};
REGISTER_METAL_OP_CREATOR(MetalMyCustomOpCreator, OpType_MyCustomOp);

Add Vulkan Implementation

Add Shader
Add a shader (*.comp) in the source/backend/vulkan/execution/glsl directory. Uses image as data container if the input memory layout is NC4HW4, uses buffer otherwise. You can refer to the existing implementations. Then, execute the makeshader.py script to compile shaders.

Implementation class declaration
Add VulkanMyCustomOp.hpp and VulkanMyCustomOp.cpp to source/backend/vulkan/execution/:

class VulkanMyCustomOp : public VulkanBasicExecution {
public:
 VulkanMyCustomOp(const Op* op, Backend* bn);
 virtual ~VulkanMyCustomOp();
 ErrorCode onEncode(const std::vector<Tensor*>& inputs, 
                    const std::vector<Tensor*>& outputs,
                    const VulkanCommandPool::Buffer* cmdBuffer) override;
private:
 // GPU Shader Parameters
 std::shared_ptr<VulkanBuffer> mConstBuffer;
 // Pipeline
 const VulkanPipeline* mPipeline;
 // Layout Descriptor Set
 std::shared_ptr<VulkanPipeline::DescriptorSet> mDescriptorSet;
};

Implement onEncode
To implement the function onEncode, you first need to do a memory layout check: if it is NC4HW4, uses image as data container, otherwise uses buffer. Return NO_ERROR after execution.

class VulkanMyCustomOpCreator : public VulkanBackend::Creator {
public:
 virtual Execution* onCreate(const std::vector<Tensor*>& inputs, 
                             const MNN::Op* op,
                             Backend* backend) const override {
     return new VulkanMyCustomOp(op, backend);
 }
};
static bool gResistor = []() {
 VulkanBackend::addCreator(OpType_MyCustomOp, new VulkanMyCustomOpCreator);
 return true;
}();

Add OpenCL Implementation

Add Kernel
Add a specific kernel (*.cl) to source/backend/opencl/execution/cl. Currently feature maps are implemented using image2d. You can refer to the existing implementations. Then execute opencl_codegen.py to generate the kernel map.

2.Implementation class declaration
Add MyCustomOp.h and MyCustomOp.cpp to source/backend/opencl/execution/:

template <typename T>
class MyCustomOp : public Execution {
public:
    virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, 
                               const std::vector<Tensor *> &outputs) override;
    virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, 
                                const std::vector<Tensor *> &outputs) override;
};

Implement onResize and onExecute
Implement the function onResize (optional), onExecute. Return NO_ERROR after execution.

OpenCLCreatorRegister<TypedCreator<MyCustomOp<cl_data_t>>> __my_custom_op(OpType_MyCustomOp);

Add OpenGL Implementation

Add Shader
Add a shader (*.glsl) under source/backend/opengl/glsl, no header file is needed. The feature map is represented by image3d. You can refer to the existing implementations. Then, execute makeshader.py under source/backend/opengl.
Add Executor
Add GLMyCustomOp.h and GLMyCustomOp.cpp to source/backend/opengl/execution/: ```cpp class GLMyCustomOp : public Execution { public: GLMyCustomOp(const std::vector &inputs, const Op op, Backend bn); virtual ~GLMyCustomOp(); virtual ErrorCode onExecute(const std::vector &inputs,
```
                         const std::vector<Tensor *> &outputs) override;
```
virtual ErrorCode onResize(const std::vector &inputs,
```
                        const std::vector<Tensor *> &outputs) override;
```

private: std::shared_ptr mProgram; };


3. Implement `onResize` and `onExecute`<br />Implement the function `onResize` (optional), `onExecute`. Return NO_ERROR after execution.
4. Register implementation class
```cpp
GLCreatorRegister<TypedCreator<GLMyCustomOp>> __my_custom_op(OpType_MyCustomOp);

English Document

Customize Op