Plumerai VLM Video API¶

This page documents the API for the VLM Video Collection and VLM Video Embedder components.

These components are used in Plumerai Video Search and Plumerai Custom AI Notifications. See those product pages for an overview of how these components are used in practice.

For context, the architecture diagram from the Video Search product page is included below:

overview

For the VLM Text Embedder and the Video and Text Matcher API documentation, visit this page.

See the minimal examples for example usage of the VLM Video Collection and VLM Video Embedder API.

VLMVideoCollection¶

start_clip¶

VLMVideoCollection.start_clip(
    include_thumbnails: bool = False, include_captioning_data: bool = False
) -> ErrorCode

Start a new clip to collect data for the VLM Video Collection.

The user needs to decide when data collection should start, e.g. when someone walks into view. The VLMVideoCollection then will start collecting the necessary data for the VLM from this point onwards for every call to process_frame, until end_clip is called.

The results will be available when calling end_clip.

If include_thumbnails is True, the collected data can be used for thumbnail generation. If include_captioning_data is True, the result can be used for caption generation. If the only purpose of the data is to do video search with the VLM Video Embedder, then both can be set to false to keep the data size smaller. Accuracy of video search is not affected by this parameter.

@param include_thumbnails If True, include data required for thumbnails. @param include_captioning_data If True, include data required for video captioning.

Arguments:

None.

Returns:

Returns SUCCESS on success, or CLIP_ALREADY_STARTED when a clip was already started.

end_clip¶

VLMVideoCollection.end_clip() -> tuple[ErrorCode, bytes, list[tuple[float, Any]]]

Ends the data collection for the clip started with start_clip.

In addition to the resulting 'clip data' bytes, this method also returns a list of selected frames which can be used for thumbnail generation or other purposes. Each frame is represented as a tuple of a timestamp (in seconds) and the input image that was originally passed to process_frame (typically a numpy array).

Arguments:

None.

Returns:

Returns a tuple with an error code, the result data and a list of selected frames. The error code is SUCCESS on success or CLIP_NOT_YET_STARTED when a clip was not started previously or was already ended before.

VLMVideoEmbedder¶

compute_embeddings¶

VLMVideoEmbedder.compute_embeddings(
    clip_data: bytes, compute_single_unit_only: bool = False
) -> tuple[ErrorCode, bytes]

Compute video embeddings on data collected using `VLMVideoCollection.

Depending on the size of the collected data, this can be compute-heavy. The user can optionally set compute_single_unit_only to compute only a single part of the video embeddings. When compute_single_unit_only is set, this needs to be called in a loop with the same arguments. The error code return value will inform the user whether all units were computed, and whether the results are valid.

Arguments:

clip_data: The data collected using VLMVideoCollection::start_clip and VLMVideoCollection::end_clip.
compute_single_unit_only: Can be set to do a partial computation of the embeddings. If set, this needs to be called in a loop, see above.

Returns:

A tuple with the resulting embeddings and an error code. Returns SUCCESS when all embeddings have been computed, EMBEDDING_PART_COMPUTED when a single unit was successfully computed (but not everything), or INVALID_CLIP_DATA when the clip data is invalid. The resulting embeddings are only valid when the error code is SUCCESS.