Building an application with the inference engine¶

Building¶

The generated inference engine library consists of four header files and a pre-compiled static library:

include/plumerai/inference_engine.h  # for the C++ API only
include/plumerai/inference_engine_c.h  # for the C API only
include/plumerai/tensorflow_compatibility.h
include/plumerai/model_defines.h
libplumerai.a

To build, make sure the header files can be found on the compiler include path, and link with libplumerai.a. Use -Wl,--gc-sections when linking to garbage-collect unused code from the binary. The library is compiled with -fdata-sections and -ffunction-sections to support this.

The exact details on how to compile and link depend on the target platform and compiler. Please refer to their respective documentation for detailed instructions.

Usage¶

The inference engine is built on top of Tensorflow Lite for Microcontrollers (TFLM), and usage is very similar. Instructions on how to use the API along with an example can be found here for the C++ API or here for the C API.

Log messages¶

Log messages are the same as in TFLM: one has to provide a C function called DebugLog to output strings, for example over UART.

CC++

#include <stdio.h>

void DebugLog(const char* s) {
    // This defines how logging is done. To be adjusted depending on the target.
    printf("%s", s);
}

#include <cstdio>

extern "C" void DebugLog(const char* s) {
    // This defines how logging is done. To be adjusted depending on the target.
    printf("%s", s);
}

Tensor arena¶

The tensor arena is a chunk of memory that stores the tensor data during model inference. The user has to provide this and make sure it is large enough. All tensors, including the model input and output, will point to a location within the tensor arena, overlapping each other when possible. For ideal usage, the tensor arena should be 16-byte aligned.

During the lifetime of the inference engine object, the tensor arena can not be used by the user other than setting input through the respective API. The advanced setup, described below, allows the user to use part of the tensor arena space for their own application.

For convenience, the inference engine generator provides a TENSOR_ARENA_SIZE define (and TENSOR_ARENA_SIZE_REPORT_MODE for report mode) in the generated include/plumerai/model_defines.h file. This define can be used directly in the application after an #include "plumerai/model_defines.h" is added, and should in most cases be sufficient. In rare cases it could only be a lower-bound and might need to be increased slightly. The user is informed about these cases when the Arena size estimation might be inaccurate message is printed in the offline report.

In the case of supplying multiple models to the Inference Engine Generator, the user has to provide multiple separate tensor arenas, one for each model. In this case, plumerai/model_defines.h adds the defines TENSOR_ARENA_SIZE_MODEL_X (and TENSOR_ARENA_SIZE_MODEL_X_REPORT_MODE for report mode), where X is 0, 1, or higher depending on the number of models. The original defines TENSOR_ARENA_SIZE and TENSOR_ARENA_SIZE_REPORT_MODE also still exist: they are the sum of all model-specific defines. It is possible to save space by re-using parts of the arena for different models, this is covered in the next section.

Advanced tensor arena¶

The tensor arena consists of two parts: the persistent and non-persistent part. The regular setup expects the user to provide a single tensor arena to cover both parts. The advanced setup gives the user more control over these parts.

The persistent arena stores persistent data such as tensor metadata and statefull LSTM variables. This data should not be overwritten by the user during the lifetime of the inference engine object. Different instances of the inference engine (for different models for example) will need separate persistent tensor arenas.
The non-persistent arena stores the activation tensor data (including the model input and output) as well as scratch buffers needed for certain layers. The non-persistent arena is only used when inference is performed (during the Invoke function) and can be re-used by the user or by another model when an inference pass is completed. It can also be used by the user for other applications.

It is important to note that the model input and output tensors are part of the non-persistent arena: after performing inference the user application should first read out the result before re-using the non-persistent arena for other purposes.

When multiple models share the same non-persistent arena, the user has to ensure that the non-persistent arena is large enough for all models: its size should be the maximum over the non-persistent size requirements for each model.

The Inference Engine Generator generates the preprocessor definitions TENSOR_ARENA_SIZE_MODEL_X_PERSISTENT, TENSOR_ARENA_SIZE_MODEL_X_PERSISTENT_REPORT_MODE, TENSOR_ARENA_SIZE_MODEL_X_NON_PERSISTENT where X is the model id. The define TENSOR_ARENA_SIZE_NON_PERSISTENT_MAX is the maximum over all non-persistent sizes so that a buffer of this size can be shared by all models.

Examples¶

Example applications can be found here for C++ and here for C.