Reporting and analysis¶
When running the Inference Engine Generator, an offline report is generated automatically. A TFLite FlatBuffer file is also available to analyse model flow. Furthermore, the inference engine runtime contains an online 'report mode' which can record per-layer latencies as well as RAM usage. The library itself does not need to be re-generated when enabling the online report mode. Note that the online report mode is not available when release_mode
is enabled.
The offline report¶
The offline report can be found in a file in the output after running the Inference Engine Generator. It contains static information which can be collected during the generation of the inference engine library. It does not contain run-time information such as timing and stack-usage, such information is only available in the online report.
Here is an example offline report:
===============================
Offline report for 'LeNet'
===============================
Model ID: 47693512213640b3b2931ea18fa9bd49
Tool version: '2023-01-01-12-00'
Input shape: (1, 28, 28, 1)
Output shape: (1, 10)
Required tensor arena size: 5232
Required tensor arena size for report-mode: 5344
| | Layer name | #ops | #params | param size | activations |
|----|-----------------|---------|-----------|--------------|---------------|
| 0 | CONV_2D | 36.50K | 0.06K | 0.15 KiB | 4.22 KiB |
| 1 | AVERAGE_POOL_2D | 4.06K | 0.00K | 0.00 KiB | 4.15 KiB |
| 2 | CONV_2D | 104.54K | 0.88K | 1.09 KiB | 4.85 KiB |
| 3 | AVERAGE_POOL_2D | 1.60K | 0.00K | 0.00 KiB | 2.28 KiB |
| 4 | RESHAPE | 0.00K | 0.00K | 0.00 KiB | 0.39 KiB |
| 5 | FULLY_CONNECTED | 48.00K | 48.12K | 48.75 KiB | 0.73 KiB |
| 6 | FULLY_CONNECTED | 10.08K | 10.16K | 11.16 KiB | 0.42 KiB |
| 7 | FULLY_CONNECTED | 0.84K | 0.85K | 0.98 KiB | 0.31 KiB |
| 8 | SOFTMAX | 0.01K | 0.26K | 1.00 KiB | 0.02 KiB |
Total ops: 205.63K (205634)
Total parameters: 60.33K (60330)
Total parameter size: 63.12 KiB (64638)
Estimated ROM size including code and model weights: 77072 bytes
The model ID is a unique model identifier, which is generated as the MD5-sum of the TFLite FlatBuffer file. The model name is either derived from the TensorFlow Keras source network or from the TFLite FlatBuffer filename, depending on what input was given to the Inference Engine Generator.
Model flow analysis using Netron¶
The Inference Engine Generator does not include any model flow analysis tools itself. Instead, we recommended to use the third party Netron tool from Lutz Roeder with a TensorFlow Lite FlatBuffer. Such a file is either already available as the supplied input to the Inference Engine Generator, or - in case of other model formats - output by the tool in the validation
sub-folder.
There are two main ways of using Netron:
- Use it directly from the web in the browser. Simply navigate to netron.app and click
Open model...
to select a TFLite FlatBuffer file. This is the easiest method. - The alternative is to install the software on a local computer. Instructions for Linux, macOS, and Windows are available on the Netron GitHub page and are as simple as downloading a single file or running a single install command. Then, Netron can be opened locally from the command-line, after which an
Open model...
button will appear to upload a TFLite FlatBuffer file.
Once a model is opened in Netron, it will show the model flow, the sizes of activations and weights, layer properties such as padding or strides, and tensor properties such as quantization parameters. By clicking on an individual layer additional properties will become visible on the right side. Furthermore, through the top left menu it is possible to export the model flow as an image file.
When viewing a network with multiple subgraphs in Netron, such as a GRU-based RNN, only the main graph is shown at start-up time. To view another subgraph, e.g. the GRU-cell itself, first click on the model input or output layer. Then, on the right side under Model properties
there will be a subgraph
menu option, where another subgraph can be selected from a dropdown menu.
The online report¶
To activate the online report mode from the C API, add #define PLUMERAI_INFERENCE_REPORT_MODE
before including the inference engine header file. For the C++ API, simply set the template argument report_mode
to true
when calling the Initialize method. Furthermore, in both cases, make sure you use the TENSOR_ARENA_SIZE_REPORT_MODE
define for the tensor arena size, see here. The functions PlumeraiInferencePrintReport
(C API) and PrintReport
(C++ API) can now be used to print a detailed report after inference.
Here is an example of API usage for report mode:
#define PLUMERAI_INFERENCE_REPORT_MODE
#include "plumerai/inference_engine_c.h"
#include "plumerai/model_defines.h"
int main(void) {
// (...)
PlumeraiInference engine = PlumeraiInferenceInit(
tensor_arena, TENSOR_ARENA_SIZE_REPORT_MODE, 0
);
// (...)
TfLiteStatus allocate_status = PlumeraiInferenceAllocateTensors(&engine);
TfLiteStatus invoke_status = PlumeraiInferenceInvoke(&engine);
// (...)
PlumeraiInferencePrintReport(&engine);
}
#include "plumerai/inference_engine.h"
#include "plumerai/model_defines.h"
int main() {
// (...)
plumerai::InferenceEngine engine;
engine.Initialize<true>(tensor_arena, TENSOR_ARENA_SIZE_REPORT_MODE);
// (...)
TfLiteStatus allocate_status = engine.AllocateTensors();
TfLiteStatus invoke_status = engine.Invoke();
// (...)
engine.print_report();
}
Here is an example report for an example model and device:
=====================
Execution report
=====================
| Metric | Value |
|--------------------------------------|-------------------|
| Total runtime | 4 521 ticks |
| Tensor arena (RAM) | 5 300 bytes |
| -- Tensor arena activation tensors | 4 976 bytes |
| -- Tensor arena persistent data | 244 bytes |
| -- Profiler (RAM) | 80 bytes |
| Stack usage during tensor allocation | 16 bytes |
| Stack usage during invoke | 180 bytes |
Breakdown per layer:
| | Layer name | #ops | #params | param size | activations | Latency (ticks) |
|---|-----------------|---------|---------|------------|-------------|-----------------|
| 0 | CONV_2D | 36.50K | 0.06K | 0.15 KiB | 4.22 KiB | 1 241 (27.45%) |
| 1 | AVERAGE_POOL_2D | 4.06K | 0.00K | 0.00 KiB | 4.15 KiB | 170 ( 3.76%) |
| 2 | CONV_2D | 104.54K | 0.88K | 1.09 KiB | 4.85 KiB | 1 812 (40.08%) |
| 3 | AVERAGE_POOL_2D | 1.60K | 0.00K | 0.00 KiB | 2.28 KiB | 75 ( 1.66%) |
| 4 | RESHAPE | 0.00K | 0.00K | 0.00 KiB | 0.39 KiB | 1 ( 0.02%) |
| 5 | FULLY_CONNECTED | 48.00K | 48.12K | 48.75 KiB | 0.73 KiB | 963 (21.30%) |
| 6 | FULLY_CONNECTED | 10.08K | 10.16K | 11.16 KiB | 0.42 KiB | 221 ( 4.89%) |
| 7 | FULLY_CONNECTED | 0.84K | 0.85K | 0.98 KiB | 0.31 KiB | 25 ( 0.55%) |
| 8 | SOFTMAX | 0.01K | 0.26K | 1.00 KiB | 0.02 KiB | 7 ( 0.15%) |
Total ops: 205.63K (205634)
Total parameters: 60.33K (60330)
Total parameter size: 63.12 KiB (64638)
See below for more information about the #ops
column.
Some extra space in the tensor arena is required in reporting mode to keep track of the per-layer latencies, but the report will indicate this usage (under the 'profiler' label, see example above) so that it can be accounted for.
What to do if the stack usage could not be estimated¶
If the stack usage reported is <see * below>
, a message with a short explanation shows up below. This section provides more details.
The stack usage measurement requires an upper-bound of expected stack usage. If this upper-bound is set too high, it might not run on systems that have a small stack size. If the upper-bound is too low, the stack usage cannot be measured by the inference engine. To address this issue, the inference engine in report mode will first assume a very low stack size upper-bound. Then, if that didn't result in a proper measurement, on the next Invoke
run, it will re-try with a twice as large stack upper-bound. This continues until the stack usage can be measured properly.
Thus, if the stack usage can't be reported, the only thing the user needs to do is re-run Invoke
. Typically Invoke
is called within an endless while loop, which means that the first few reports might not contain stack usage information, but any subsequent ones will.
Stack usage with RTOS¶
When an RTOS is used, the reported stack usage can be inaccurate because of task switching and interrupts. When the task is swapped out, the OS stores extra information on the stack which influences the measurement. To get an accurate value, temporarily disable interrupts or set the task priority to the highest value.
Timer function¶
The inference engine runtime expects a C function PlumeraiGetCurrentTime
to be provided by the user. It should return a uint32_t
with the current time, and takes no arguments. The result can be in any desired unit (e.g. ms), the report will print them without any unit conversion.
Number of ops and parameters¶
The column #ops
in the model report represents the number of operations in a layer, but the type of operation (multiply, add, compare) can differ per layer.
In most cases, #ops
is the number of total output elements. For example for an elementwise Tanh
layer, ops
is the number of tanh functions that have to be computed. In some cases, like a convolution layer, this does not represent the number of operations or MACs (multiply-accumulates). The following definitions are used for those layers:
- Add Two
ops
per added element. The implementation ofAdd
in quantized format requires two MAC operations per added tensor element. - AveragePool2D The total number of averaged elements.
- Concatenation Zero
ops
in case the layer can be optimized out, or oneop
per output element in case an explicit memory copy is needed. - Conv2D, DepthwiseConv2D, FullyConnected The number of
ops
is the number of MACs. The number of parameters and parameter size includes the parameters needed for the quantization step of the output. - MaxPool2D The total number of elements over which the max is taken.
- Reshape This layer does not involve any computations and will always report zero
ops
. - Softmax The number of
ops
is set to the number of softmax outputs. The parameter size includes the size of a look-up-table that is used by the implementation. The size of this look-up-table is dependent on the INT8 or INT16 quantization statistics and thebeta
parameter of the softmax layer. A largerinput_scale
and largerbeta
may decrease the size of the look-up-table.
Activation size¶
The column activations
in the model report represents the size of the activation tensors used by the layer, as well as any scratch buffers. This can help track down bottlenecks in the model to reduce RAM usage.
In some cases, the reported value can be smaller than the sum of the input and output sizes. This is because some layers support (partial) in-place computation.
The activations
column reports activations and scratch buffers, but not persistent data such as look-up-tables that could be placed in RAM. The total persistent data is shown under the label Tensor arena persistent data
near the top of the report. The label Tensor arena activation tensors
covers the total space needed for this activations
column. It is possible that the total required activations size is larger than the maximum size required for each layer. This can have several reasons:
- In a model with shortcut connections, the activations of a shortcut branch are not counted for the
activations
of the layers in the other branch. - If one layer requires a tensor from offset
0
to100
and another layer requires a tensor from offset50
to150
then both layers require100
bytes by themselves but the total model requires at least150
bytes. Note that the memory planning is dependent on the global model structure in a complex way, and every layer can indirectly influence the planning of every other layer.