Skip to content

Tutorial: Plumerai People Detection library on ESP32-S3

On this page you will find instructions of how to run the Plumerai People Detection library on the Espressif ESP32-S3-EYE and similar boards. In particular, we demonstrate how to build a demo application with camera input and display output using Espressif's ESP-IDF and ESP-WHO software packages. However, this is not a requirement for the library: it can run bare-metal, e.g. without FreeRTOS. The example on this page is tested with the Espressif ESP32-S3-EYE, the same as used in our ESP32-S3 demonstrator, but will work in a similar way with other boards that are supported by ESP-WHO.

For the Seeed Studio XIAO ESP32S3 Sense, similar steps are required, but then using EdgeLab instead of ESP-WHO. Please contact Plumerai if you want to have a more detailed description of the steps required.

In step 3 onwards we will modify existing code. If you check out ESP-WHO as a git repository in step 1, that means you can also apply a git patch directly to the code to make all required changes. That patch can be found here without the optional improvements or here with the optional improvements. In both cases, these are for the main repository. On top of that, you need a small patch for the esp32-camera sub-repository. However, we recommend to instead follow the instructions below to get a good understanding of what has to be done. The patch file can be consulted in case some of the instructions are unclear. At each step there will also be a smaller patch file specific to that step. Note that applying a patch file for a step assumes you have applied previous patches in order, otherwise conflicts might show up.

Step 1: Install Espressif's ESP-IDF and ESP-WHO

We first have to install Espressif's ESP-IDF and ESP-WHO software packages.

The recommended approach to install ESP-IDF is to follow the official instructions, which includes an option to install ESP-IDF as an IDE plugin, or by manual installation. This tutorial assumes that version 4.4 of ESP-IDF is installed, version 5 or newer will likely not work because ESP-WHO requires version 4.4. However the Plumerai People Detection should work with any version of ESP-IDF.

Once you have ESP-IDF installed, you can follow the instructions for installing ESP-WHO. Assuming you have git already installed, this can be as simple as:

git clone --recursive https://github.com/espressif/esp-who.git

For reference, this tutorial was tested using the latest commit in master at the time of writing, which was 5497ff27a27770eb06e9592eb7e135a2b3c4d0e1.

Step 2: Build an existing example application

In step 3 we will modify an existing Espressif example application to use the Plumerai People Detection library. But before we do, it is good practice to build and run the unmodified example, making sure everything works fine.

The following commands assume ESP-IDF is installed and available on command-line. Depending on your set-up, you might have to run something like . /path/to/esp-idf/export.sh.

We will modify the human_face_detection example. To build it, navigate to the ESP-WHO folder in a terminal and then run:

cd examples/human_face_detection/lcd
idf.py set-target esp32s3

To build, flash, and run the example on device, attach an ESP32-S3 device through USB and run:

idf.py flash monitor  # use Ctrl+] (control + closing squared bracket) to exit

A standard face detection demo application should now run on device, and the log should print things like this as soon as a face is in view:

I (10059) detection_result: [ 0]: ( 51,  -9, 173, 159)
I (10059) detection_result:       left eye: ( 89,  54), right eye: (143,  53), nose: (119,  82), mouth left: ( 96, 111), mouth right: (138, 109)

If you see weird colors on the display this can be solved by reconnecting the USB cable.

If the above steps didn't work, please consult the generic ESP-IDF or ESP-WHO documentation, or the documentation specific for your device. Once this is working fine, you are ready to integrate the Plumerai People Detection software in the next steps.

Step 3: Prepare the build configuration

Before modifying the application, we'll change the build configuration to fit our needs. We need to make two changes in the build configuration, which can be opened by running idf.py menuconfig in the examples/human_face_detection/lcd sub-folder of your ESP-WHO installation:

  1. Navigate to Compiler optionsOptimization Level (...) and select Optimize for performance (-O2). Use the escape key to navigate back to the main menu for the next step.
  2. Navigate to Component configESP System Settings and disable the Watch CPU0 Idle Task and Watch CPU1 Idle Task options: in front of these it should look like [ ] and not like [*]. Now press q to exit and save your changes.

Step 4: Include the Plumerai People Detection files

For reference, the patch file for this step (without the unzipping of the library part) can be found here.

The next step is to extract the contents of the plumerai_people_detection_micro_esp32.zip file (if you don't have one, please contact Plumerai) to the examples/human_face_detection/lcd sub-folder of your ESP-WHO installation. There should now (additionally) be the following files in the lcd folder:

├── plumerai_people_detection_micro_esp32
│   ├── include
│   │   └── plumerai
│   │       ├── box_prediction.h
│   │       ├── model_defines.h
│   │       └── people_detection_micro.h
│   ├── lib
│   │   └── esp32-s3
│   │       └── libplumerai_people_detection_micro.a
│   └── VERSION

Take a note of the values inside include/plumerai/model_defines.h, in particular the PLUMERAI_IMAGE_WIDTH and PLUMERAI_IMAGE_HEIGHT image dimensions and the mentioned image format. In this tutorial we assume the image dimensions are 640x480 and the image format is RGB565 with 2 bytes per pixel. If this is different in your case, either contact Plumerai to request a new library with the right settings, or adjust the steps below according to the different image dimensions and format.

First, modify examples/human_face_detection/lcd/CMakeLists.txt and add the following line just above the project(human_face_detection_lcd) line:

include_directories(${CMAKE_CURRENT_LIST_DIR}/plumerai_people_detection_micro_esp32/include)

And the following line just below project(human_face_detection_lcd) (i.e. at the end of the file):

target_link_libraries(${PROJECT_NAME}.elf ${CMAKE_CURRENT_LIST_DIR}/plumerai_people_detection_micro_esp32/lib/esp32-s3/libplumerai_people_detection_micro.a)

To verify if all was fine, it is possible to run idf.py flash monitor again. Of course, the application hasn't changed, but it should now be possible to use the Plumerai People Detection API in the application.

Step 5: Prepare the display code to accept different camera resolutions

For reference, the patch file for this step can be found here.

For best quality we want to increase the default camera resolution of 240x240 to something larger, say 640x480. The reason that this low-resolution was chosen in the example, is that the LCD is also assumed to work at 240x240, which is indeed the case for the Espressif ESP32-S3-EYE board. Thus, in this step we will modify the display code to prepare for our camera resolution change.

Most of the changes are in who_human_face_detection.cpp in the components/modules/ai sub-folder of ESP-WHO. First, we create some new globals (e.g. between the #define TWO_STAGE_ON 1 define and the static const char *TAG variable):

constexpr int lcd_width = 240;
constexpr int lcd_height = 240;
constexpr int lcd_display_height = 180;  // For 4:3 aspect ratio
camera_fb_t * lcd_buffer = nullptr;

Then, we initialize the LCD buffer at the very start of the static void task_process_handler(void *arg) function:

    lcd_buffer = reinterpret_cast<camera_fb_t *>(heap_caps_aligned_alloc(16, sizeof(camera_fb_t), MALLOC_CAP_8BIT));
    lcd_buffer->height = lcd_height;
    lcd_buffer->width = lcd_width;
    lcd_buffer->len = 2 * lcd_buffer->width * lcd_buffer->height;
    lcd_buffer->format = PIXFORMAT_RGB565;
    ESP_LOGI(TAG, "Allocating %d bytes LCD buffer in external PSRAM", lcd_buffer->len);
    lcd_buffer->buf = reinterpret_cast<uint8_t *>(heap_caps_aligned_alloc(16, lcd_buffer->len, MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM));
    for (int i = 0; i < lcd_buffer->len; ++i) { lcd_buffer->buf[i] = 0; }

Further down, at the start of the if (xQueueReceive(xQueueFrameI, &frame, portMAX_DELAY)) section (just before #if TWO_STAGE_ON and the detector.infer calls), we resize the camera frame to the dimensions of the LCD buffer:

                dl::image::resize_image_nearest(
                    reinterpret_cast<uint16_t *>(frame->buf), {(int)frame->height, (int)frame->width, 1},
                    reinterpret_cast<uint16_t *>(lcd_buffer->buf), {lcd_display_height, lcd_width, 1}
                );

Next, a little bit below, just before if (detect_results.size() > 0), we can release the camera buffer, since we have consumed it for inference and for the LCD:

                esp_camera_fb_return(frame);

And for the last change in this file, we modify xQueueSend(xQueueFrameO, &frame, portMAX_DELAY); to send the LCD buffer instead:

                xQueueSend(xQueueFrameO, &lcd_buffer, portMAX_DELAY);

Since we return the camera framebuffer above already, we also change components/modules/lcd/who_lcd.c by removing or commenting out line 31:

                esp_camera_fb_return(frame);    // remove this line

To verify if all was fine, it is possible to run idf.py flash monitor again. The original human face detection algorithm is still running, but won't display results on screen anymore. Furthermore, the 240x240 camera image is now squeezed into 240x180.

Step 6: Integrate Plumerai People Detection

For reference, the patch file for this step can be found here.

Now it is finally time to use the Plumerai People Detection API.

Most of the changes are in who_human_face_detection.cpp in the components/modules/ai sub-folder of ESP-WHO. First, at the top of the file (e.g. instead of the old setting #define TWO_STAGE_ON 1) we include the Plumerai library headers and define some utility functions:

#include "plumerai/model_defines.h"
#include "plumerai/people_detection_micro.h"

extern "C" void DebugLog(const char *format, va_list args) { vprintf(format, args); }

int clip(float value, int max) { return std::max(0, std::min(static_cast<int>(value), max)); }

void draw_detection(BoxPrediction &p, uint16_t * buffer, int width, int height) {
    if (p.confidence < 0.7) { return; }   // confidence threshold, adjust as needed
    int color = 63488;  // red in RGB565
    int x_min = clip(p.x_min * width, width - 1);
    int y_min = clip(p.y_min * height, height - 1);
    int x_max = clip(p.x_max * width, width - 1);
    int y_max = clip(p.y_max * height, height - 1);
    dl::image::draw_hollow_rectangle(buffer, height, width, x_min, y_min, x_max, y_max, color);
}

Now we can replace the initialization of the human face detector with the Plumerai people detector. We replace the following lines:

    HumanFaceDetectMSR01 detector(0.3F, 0.3F, 10, 0.3F);
#if TWO_STAGE_ON
    HumanFaceDetectMNP01 detector2(0.4F, 0.3F, 10);
#endif

with the following (see the PeopleDetectionInit docs for more information):

    ESP_LOGI(TAG, "Reserving %d bytes of tensor arena in internal RAM", TENSOR_ARENA_SIZE);
    auto tensor_arena = reinterpret_cast<unsigned char *>(heap_caps_aligned_alloc(16, TENSOR_ARENA_SIZE, MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL));
    if (tensor_arena == nullptr) { ESP_LOGI(TAG, "Error: could not allocate tensor arena"); }

    ESP_LOGI(TAG, "Initializing the Plumerai People Detection");
    auto error_code = PeopleDetectionInit(tensor_arena);
    if (error_code != 0) { ESP_LOGI(TAG, "Error: could not initialize Plumerai People Detection"); }

    constexpr int max_detections = 20;
    BoxPrediction predictions[max_detections];

Next, in the while-loop we can replace these calls to the old detector:

#if TWO_STAGE_ON
                std::list<dl::detect::result_t> &detect_candidates = detector.infer((uint16_t *)frame->buf, {(int)frame->height, (int)frame->width, 3});
                std::list<dl::detect::result_t> &detect_results = detector2.infer((uint16_t *)frame->buf, {(int)frame->height, (int)frame->width, 3}, detect_candidates);
#else
                std::list<dl::detect::result_t> &detect_results = detector.infer((uint16_t *)frame->buf, {(int)frame->height, (int)frame->width, 3});
#endif

with the following (see the PeopleDetectionProcessFrame docs for more information):

                int num_results = 0;
                constexpr float framerate = 2.5f;  // in frames-per-second (FPS)
                error_code = PeopleDetectionProcessFrame(predictions, max_detections, &num_results, 1 / framerate, frame->buf);
                if (error_code != 0) { ESP_LOGI(TAG, "Error: could not process a frame using Plumerai People Detection"); }
                ESP_LOGI(TAG, "Detected %d person(s)", num_results);

Furthermore, still in this file, we replace the old box drawing code:

                if (detect_results.size() > 0)
                {
                    draw_detection_result((uint16_t *)frame->buf, frame->height, frame->width, detect_results);
                    print_detection_result(detect_results);
                    is_detected = true;
                }

with one suited for the Plumerai BoxPrediction format:

                for (int result_id = 0; result_id < num_results; ++result_id) {
                    draw_detection(predictions[result_id], reinterpret_cast<uint16_t *>(lcd_buffer->buf), lcd_width, lcd_display_height);
                    is_detected = true;
                }

And finally, at the very bottom of the file, we increase the stack size from 4KB to 8KB in the first call to xTaskCreatePinnedToCore:

    xTaskCreatePinnedToCore(task_process_handler, TAG, 8 * 1024, NULL, 5, NULL, 0);

We also have to request a different camera resolution from the application. We do that by modifying main/app_main.cpp in the examples/human_face_detection/lcd folder, changing FRAMESIZE_240X240 into FRAMESIZE_VGA, and reducing the number of frame buffers from 2 to 1 to obtain:

    register_camera(PIXFORMAT_RGB565, FRAMESIZE_VGA, 1, xQueueAIFrame);

Now, we are ready to run idf.py flash monitor again, to see if everything builds fine. However, when running it on device, you'll notice that the detection quality is either very poor or non-existent at all. On the terminal you'll likely see the following message appear many times:

human_face_detection: Detected 0 person(s)

This is because of the way the RGB565 format is handled inside the ESP-WHO code, which we'll fix in the next step.

Step 7: Fix the RGB565 camera and LCD settings

For reference, the patch file for this step can be found here for the main repository and here for the esp32-camera sub-repository.

The ESP-WHO (and the esp32-camera dependency) use an unconventional version of the RGB565 format, where the green values (which are split over two bytes) are partially in reversed order. This is an issue for the Plumerai People Detection algorithm, which expects regular little-endian RGB565.

First, we modify the camera code. In this example we assume the camera is an OV2640, which is the case on the Espressif ESP32-S3-EYE board for example. For the OV2640 camera we modify components/esp32-camera/sensors/private_include/ov2640_settings.h by appending | IMAGE_MODE_LBYTE_FIRST to the line with IMAGE_MODE_RGB565 (line 412 at time of writing), resulting in:

    {IMAGE_MODE, IMAGE_MODE_RGB565 | IMAGE_MODE_LBYTE_FIRST},

Next, we modify the LCD code. Here we assume the LCD controller is the ST7789, which is the case on the Espressif ESP32-S3-EYE. For the ST7789 controller, modify the static void lcd_st7789_init_reg(void) function in components/screen/controller_driver/st7789/st7789.c by adding the following four lines after the first two lines in that function (LCD_WRITE_CMD(0x3A); and LCD_WRITE_DATA(0x05);):

    // Set proper RGB565 little-endianness
    LCD_WRITE_CMD(0xB0);
    LCD_WRITE_DATA(0x00);
    LCD_WRITE_DATA(0xF8);

Now, re-run idf.py flash monitor, point the camera at one or more people, and notice red bounding boxes being drawn around them.

Step 8: Optional improvements

Now that a first version of the Plumerai People Detection software is running on your ESP32-S3, it is time to make some improvements.

8a. Rename some files and variables

For reference, the patch file for this step can be found here.

A trivial change is to rename occurrences of human_face_detection (the demo application that we modified) to plumerai_people_detection: e.g. the CMake project name, the .cpp and .hpp file names, the function names, and the TAG for logging.

8b. Draw more visible boxes

For reference, the patch file for this step can be found here.

You might have noticed that the red boxes are not that well visible on a small display. One solution would be to modify the draw_detection function to draw multiple boxes close to each other to obtain a wider box border. Here is an example of the modified function to obtain a border width of 4 pixels:

void draw_detection(BoxPrediction &p, uint16_t * buffer, int width, int height) {
    if (p.confidence < 0.7) { return; }   // confidence threshold, adjust as needed
    int color = 63488;  // red in RGB565
    for (int w = 0; w < 4; ++w) {
        int x_min = clip(p.x_min * width + w, width - 1);
        int y_min = clip(p.y_min * height + w, height - 1);
        int x_max = clip(p.x_max * width + w, width - 1);
        int y_max = clip(p.y_max * height + w, height - 1);
        dl::image::draw_hollow_rectangle(buffer, height, width, x_min, y_min, x_max, y_max, color);
    }
}

8c. Use the camera callback for better speed

For reference, the patch file for this step can be found here.

You might have noticed that the speed of the whole application is not as good as the pre-compiled Plumerai ESP32-S3 demo. The main cause is that we release the camera framebuffer towards the end of our code. This, in combination with using only one framebuffer, puts the camera capture, the resizing for display, and the Plumerai People Detection all in sequence. We can run the camera capture in parallel again by using the PeopleDetectionReadDataCallback function, although it does complicate code a bit.

First, in components/modules/ai/who_plumerai_people_detection.cpp, we move the camera_fb_t *frame = NULL out of the task_process_handler function and make it a global, e.g. just below the camera_fb_t * lcd_buffer = nullptr definition. Then, right after, we declare a new function which we'll set as callback (make sure this code is placed after the LCD constants). Together this becomes:

camera_fb_t *frame = NULL;  // (don't forget to remove this one from the `task_process_handler` function!)

void resize_for_display_and_free_framebuffer(void *_) {
    dl::image::resize_image_nearest(
        reinterpret_cast<uint16_t *>(frame->buf), {(int)frame->height, (int)frame->width, 1},
        reinterpret_cast<uint16_t *>(lcd_buffer->buf), {lcd_display_height, lcd_width, 1}
    );

    // The input resizing is done (in network) and so is the LCD resizing (above),
    // now we can return the framebuffer such that the camera can get a new frame.
    esp_camera_fb_return(frame);
}

Then, before the while (true) loop starts but after PeopleDetectionInit is called we set the callback:

    PeopleDetectionReadDataCallback(resize_for_display_and_free_framebuffer);

In the same file, we should remove two pieces of code:

  1. Remove the call to resize_image_nearest that is a few lines before the PeopleDetectionProcessFrame call in the loop.
  2. Remove the call to esp_camera_fb_return that is a few lines after the PeopleDetectionProcessFrame call in the loop.

Finally, we can re-enable the double camera framebuffer again. We do that by modifying main/app_main.cpp in the examples/human_face_detection/lcd folder, changing 1 back into 2, thus obtaining:

    register_camera(PIXFORMAT_RGB565, FRAMESIZE_VGA, 2, xQueueAIFrame);

If we re-run idf.py flash monitor we should see a better overall speed and lower latency.

8d. Measure delta-t instead of guessing

For reference, the patch file for this step can be found here.

The PeopleDetectionProcessFrame function has a float delta_t argument that we previously set to 1 / framerate where we guessed the framerate to be 2.5 FPS. To obtain better quality predictions, it is recommended to measure this instead.

We can measure this by first adding the following just before the while (true) loop starts in components/modules/ai/who_plumerai_people_detection.cpp:

    float delta_t_s = 1.0f;  // will be set below properly
    auto prev_end_time = esp_timer_get_time();

In the same file we then add the following after the loop with the call to draw_detection:

                auto time_since_last_update = esp_timer_get_time() - prev_end_time;
                delta_t_s = static_cast<float>(time_since_last_update) / 1000000.0f;
                prev_end_time = esp_timer_get_time();
                ESP_LOGI(TAG, "Total speed of demo: %.3fms (%.1f FPS)", delta_t_s * 1000.f, 1.0f / delta_t_s);

Finally, we substitute 1 / framerate with delta_t_s in the call to PeopleDetectionProcessFrame and remove the framerate variable.

8e. Use different colors for different people

For reference, the patch file for this step can be found here.

In draw_detection we always use red as the color for boxes. To distinguish individuals, we can use the tracked identifier value id of the BoxPrediction class to set colors.

In components/modules/ai/who_plumerai_people_detection.cpp, we can change int color = 63488; in the draw_detection function with int color = colors[p.id % num_colors]; and add the following as a global before that function call:

constexpr int num_colors = 19;
const int colors[num_colors] = {15785, 57545, 65283, 1049,  62470, 37110, 18334, 61852, 55207, 65018, 1040,  56831, 43877, 65497, 32768, 45048, 33792, 65206, 1};