Skip to content

Model format support

The Inference Engine Generator supports three different model formats natively. To use the inference engine with other model formats, they can first be converted using third-party tools such as the ONNX to TF converter.

Natively supported model formats

This section gives some more details about the natively supported formats. In all cases, the data-types should be in quantized 8-bit integer format (INT8) or quantized 16-bit integer format (INT16), see the supported data-types page for layer-by-layer requirements.

TensorFlow Lite FlatBuffer format

The recommended input format for the Inference Engine Generator is the TensorFlow Lite FlatBuffer format with the .tflite extension. Such a model can be converted using the TFlite converter from various input formats, including TensorFlow and Keras models. As part of the conversion process, the models can be converted to quantized 8-bit integer format. We refer to the official converter documentation and the tutorials on post training quantization for information on how to do this.

A TFLite FlatBuffer can be visualized locally or in the browser using the Netron tool.

TensorFlow/Keras 'SavedModel' format

The TensorFlow/Keras SavedModel is supported as input to the Inference Engine Generator. This storage format is an entire directory with multiple files. Inside the Inference Engine Generator this format is converted using the TFlite converter, which means the SavedModel should already be in INT8 or INT16 format, e.g. using quantization aware training. To have more control over the converter options or debug issues, it is recommended to run the converter manually and input a TFLite FlatBuffer to the inference engine generator instead.

Keras H5 format

Similar as for the SavedModel format described above, the Keras H5 format with .h5 extension is also supported as input. Inside the Inference Engine Generator this format is converted using the TFlite converter, which means the Keras H5 model should already be in INT8 or INT16 format, e.g. using quantization aware training. To have more control over the converter options or debug issues, it is recommended to run the converter manually and input a TFLite FlatBuffer to the inference engine generator instead.

Other formats through third-party tools

The Inference Engine Generator can be used with other input model formats as long as they are converted to one of the above supported formats and quantized to INT8 or INT16. For best performance, however, it is recommended to directly specify the model in TensorFlow/Keras format and convert to a TensorFlow Lite FlatBuffer manually.

Nevertheless, in some cases conversion tools can be convenient. Therefore, a list of such tools is shared below, but note that no support is provided by us in case of issues with such tools. Here are some of the best known / useful conversion tools:

  • PyTorch models can be exported to ONNX format directly from PyTorch, see here or here for examples. ONNX models can be converted further to TensorFlow using one of the tools below.
  • The official open-source onnx-tensorflow backend can convert ONNX models to TensorFlow.
  • The open-source openvino2tensorflow converter can convert ONNX models to OpenVINO format and subsequently into TensorFlow or TensorFlow Lite FlatBuffer format. The advantage of this method is that it can also perform data layout transformations from NCHW to NHWC (see also below).
  • The open-source onnx2tflite tool can also perform data layout transformations on ONNX models, and can output a TensorFlow/Keras SavedModel as well as a TensorFlow Lite FlatBuffer file.
  • The open-source onnx2keras tool can convert ONNX models to Keras models and supports data layout transformations as well, through the change_ordering option.

Note that with all of the above tools certain operations and layer configurations might not be supported. Furthermore, conversion bugs might result to invalid results or sub-optimal TFLite models. Therefore, it is recommended to perform sanity checks after each conversion step, as well as visual inspection using the Netron tool.

To make it easier to get started, a simple conversion example is included below which converts in three steps from PyTorch to ONNX to TensorFlow and finally to TensorFlow Lite FlatBuffer format.

Example: PyTorch to ONNX

First we define an example model in PyTorch. To run this example snippet it is required to install PyTorch first, e.g. pip install torch.

import torch

IMAGE_SHAPE = (1, 28, 28)  # Example 28x28 grayscale image (e.g. MNIST)

class LeNetModel(torch.nn.Module):
    def __init__(self) -> None:
        """Example convolutional neural network based on the original LeNet-5
        architecture from Yann LeCun, see https://en.wikipedia.org/wiki/LeNet."""
        super(LeNetModel, self).__init__()
        self.relu = torch.nn.ReLU(inplace=False)
        self.conv1 = torch.nn.Conv2d(1, 6, kernel_size=3, padding=(0, 0))
        self.pool1 = torch.nn.AvgPool2d(kernel_size=2, padding=(0, 0))
        self.conv2 = torch.nn.Conv2d(6, 16, kernel_size=3, padding=(0, 0))
        self.pool2 = torch.nn.AvgPool2d(kernel_size=2, padding=(0, 0))
        self.cnn_out_dim = 16 * 5 * 5
        self.fc1 = torch.nn.Linear(in_features=self.cnn_out_dim, out_features=120)
        self.fc2 = torch.nn.Linear(in_features=120, out_features=84)
        self.fc3 = torch.nn.Linear(in_features=84, out_features=10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.conv1(x))
        x = self.pool1(x)
        x = self.relu(self.conv2(x))
        x = self.pool2(x)
        x = torch.reshape(x, (1, self.cnn_out_dim))
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = torch.softmax(self.fc3(x), -1)
        return x

torch_model = LeNetModel()
# (now train the model or load some weights)

The chosen model is similar to the LeNet TensorFlow example. After the model is trained (e.g. using normal floating-point data-types), it can be converted to ONNX natively from PyTorch:

example_input = torch.randn(1, *IMAGE_SHAPE)
torch.onnx.export(
    torch_model,
    example_input,
    "model.onnx",
    export_params=True,
    opset_version=13,
    input_names=["input"],
    output_names=["output"],
)

The resulting model.onnx can now be inspected with Netron and verified with the ONNX runtime. For more information about the arguments of the PyTorch onnx.export function, see the official tutorial or documentation.

Example: ONNX to TensorFlow SavedModel

Next, we illustrate the steps of going from an ONNX model to a TensorFlow SavedModel. This step assumes that a file model.onnx is already created, for example from PyTorch using the example above, but it can also come from another source.

To run this example snippet it is required to install ONNX and ONNX TensorFlow, e.g. pip install onnx onnx-tf.

import onnx
import onnx_tf

onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

tf_rep = onnx_tf.backend.prepare(onnx_model, device="CPU")
output_path = "tf_saved_model"  # modify as desired
tf_rep.export_graph(output_path)

The result is a new folder tf_saved_model with a TensorFlow SavedModel inside. This model can be loaded with TensorFlow and evaluated in inference mode to make sure the conversion was done correctly.

If the model is already INT8 or INT16 quantized, it can be fed to the Inference Engine Generator directly. If it still needs quantization (which is the case when starting with the PyTorch example above) it needs to be manually converted to a TensorFlow Lite FlatBuffer file. See below for this final step.

Example: SavedModel to INT8 TensorFlow Lite FlatBuffer

In the final step, the resulting TensorFlow SavedModel will be converted to a TensorFlow Lite FlatBuffer and INT8 quantized. This step is similar to the steps taken for the example TensorFlow Keras models using post training quantization (see also e.g. LeNet, MobileNetV2 and an LSTM.) For this step TensorFlow needs to be installed, e.g. pip install tensorflow.

from pathlib import Path
import tensorflow as tf

IMAGE_SHAPE = (1, 28, 28)  # Example 28x28 grayscale image (e.g. MNIST)

def dataset_example(num_samples: int = 100):
    """Placeholder for a representative data-set. For best quantization
    performance, replace this with a few examples from your own data-set, the
    more, the better. This should include any pre-processing needed."""
    for _ in range(num_samples):
        yield [tf.random.uniform(shape=(1, *IMAGE_SHAPE), minval=-1, maxval=1)]

def convert_model(model_path: str, dataset_gen) -> bytes:
    converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
    converter.experimental_new_converter = True
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.representative_dataset = dataset_gen
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    converter.inference_input_type = tf.int8
    converter.inference_output_type = tf.int8
    return converter.convert()

input_path = "tf_saved_model"  # modify as required
int8_tf_model = convert_model(input_path, dataset_example)
Path("model.tflite").write_bytes(int8_tf_model)

This creates a model.tflite which can be inspected with Netron. It is ready to be used with the Inference Engine Generator.

Notes on data layouts

Note that there might be extra overhead when converting from PyTorch or ONNX to TFLite due to additional transpose or reshape layers. This is because PyTorch and ONNX use a different data layout (NCHW) compared to TensorFlow Lite (NHWC). Assuming 4D tensors, PyTorch and ONNX will lay them out in memory as NCHW: (batch, channels, height, width), also known as 'channels first'. However, TensorFlow Lite works with NHWC: (batch, height, width, channels), also known as 'channels last'.

When following the above tutorial there is no explicit data layout conversion done, and thus extra transpose ops will be inserted into the network where necessary. This can be observed by comparing the model.onnx and model.tflite files in Netron. Because this can cause significant run-time overhead in certain cases, it is advised to either re-design the neural network to minimize the amount of inserted transpose ops, or to use one of the conversion tools listed above that do support data layout conversion. Note that some of these tools are experimental and not well tested, so be sure to verify correctness after each conversion step. More information about this topic can be found on the internet, such as here, here, here, or here.