Multi-Modality#

vLLM provides experimental support for multi-modal models through the vllm.multimodal package.

vllm.inputs.PromptStrictInputs accepts an additional attribute multi_modal_data which allows you to pass in multi-modal input alongside text and token prompts.

Note

multi_modal_data can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through

vllm.multimodal.MULTIMODAL_REGISTRY.

By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow the guide for adding a new multimodal model..

# TODO: Add more instructions on how to do that once embeddings is in.

Guides#

Module Contents#

Registry#

vllm.multimodal.MULTIMODAL_REGISTRY = <vllm.multimodal.registry.MultiModalRegistry object>#

The global MultiModalRegistry is used by model runners to dispatch data processing according to its modality and the target model.

class vllm.multimodal.MultiModalRegistry(*, plugins: Sequence[MultiModalPlugin] = DEFAULT_PLUGINS)[source]#

A registry to dispatch data processing according to its modality and the target model.

The registry handles both external and internal data input.

create_input_mapper(model_config: ModelConfig)[source]#

Create an input mapper (see map_input()) for a specific model.

get_max_multimodal_tokens(model_config: ModelConfig) int[source]#

Get the maximum number of multi-modal tokens for profiling the memory usage of a model.

See MultiModalPlugin.get_max_multimodal_tokens() for more details.

map_input(model_config: ModelConfig, data: MultiModalDataBuiltins | Dict[str, Any]) MultiModalInputs[source]#

Apply an input mapper to the data passed to the model.

See MultiModalPlugin.map_input() for more details.

register_image_input_mapper(mapper: Callable[[InputContext, object], MultiModalInputs] | None = None)[source]#

Register an input mapper for image data to a model class.

See MultiModalPlugin.register_input_mapper() for more details.

register_input_mapper(data_type_key: str, mapper: Callable[[InputContext, object], MultiModalInputs] | None = None)[source]#

Register an input mapper for a specific modality to a model class.

See MultiModalPlugin.register_input_mapper() for more details.

register_max_image_tokens(max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source]#

Register the maximum number of image tokens input to the language model for a model class.

register_max_multimodal_tokens(data_type_key: str, max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source]#

Register the maximum number of tokens, belonging to a specific modality, input to the language model for a model class.

Base Classes#

vllm.multimodal.MultiModalDataDict#

alias of Union[MultiModalDataBuiltins, Dict[str, Any]]

class vllm.multimodal.MultiModalInputs(dict=None, /, **kwargs)[source]#

Bases: _MultiModalInputsBase

A dictionary that represents the keyword arguments to forward().

static batch(inputs_list: List[MultiModalInputs], device: torch.types.Device) Dict[str, torch.Tensor | List[torch.Tensor]][source]#

Batch multiple inputs together into a dictionary.

class vllm.multimodal.MultiModalPlugin[source]#

Bases: ABC

Base class that defines data processing logic for a specific modality.

In particular, we adopt a registry pattern to dispatch data processing according to the model being used (considering that different models may process the same data differently). This registry is in turn used by MultiModalRegistry which acts at a higher level (i.e., the modality of the data).

abstract get_data_key() str[source]#

Get the data key corresponding to the modality.

get_max_multimodal_tokens(model_config: ModelConfig) int[source]#

Get the maximum number of multi-modal tokens for profiling the memory usage of a model.

If this registry is not applicable to the model, 0 is returned.

The model is identified by model_config.

map_input(model_config: ModelConfig, data: object) MultiModalInputs[source]#

Apply an input mapper to a data passed to the model, transforming the data into a dictionary of model inputs.

The model is identified by model_config.

Raises:

TypeError – If the data type is not supported.

register_input_mapper(mapper: Callable[[InputContext, object], MultiModalInputs] | None = None)[source]#

Register an input mapper to a model class.

When the model receives input data that matches the modality served by this plugin (see get_data_key()), the provided function is invoked to transform the data into a dictionary of model inputs.

If None is provided, then the default input mapper is used instead.

register_max_multimodal_tokens(max_mm_tokens: int | Callable[[InputContext], int] | None = None)[source]#

Register the maximum number of multi-modal tokens input to the language model for a model class.

If None is provided, then the default calculation is used instead.

Image Classes#

class vllm.multimodal.image.ImagePlugin[source]#

Bases: MultiModalPlugin

get_data_key() str[source]#

Get the data key corresponding to the modality.