Multi-Modality#
vLLM provides experimental support for multi-modal models through the vllm.multimodal package.
vllm.inputs.PromptStrictInputs accepts an additional attribute multi_modal_data
which allows you to pass in multi-modal input alongside text and token prompts.
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model,
you must decorate the model class with MULTIMODAL_REGISTRY.register_dummy_data,
as well as MULTIMODAL_REGISTRY.register_input for each modality type to support.
Module Contents#
Registry#
- vllm.multimodal.MULTIMODAL_REGISTRY#
The global
MultiModalRegistrywhich is used by model runners.
- class vllm.multimodal.MultiModalRegistry(*, plugins: Sequence[MultiModalPlugin[Any]] = DEFAULT_PLUGINS)[source]#
This registry is used by model runners to dispatch data processing according to its modality and the target model.
- create_input_processor(model_config: ModelConfig, vlm_config: VisionLanguageConfig)[source]#
Create an input processor (see
process_input()) for a specific model.
- dummy_data_for_profiling(seq_len: int, model_config: ModelConfig, vlm_config: VisionLanguageConfig)[source]#
Create dummy data for memory profiling.
- process_input(data: MultiModalData, model_config: ModelConfig, vlm_config: VisionLanguageConfig)[source]#
Apply an input processor to a
MultiModalDatainstance passed to the model.See
MultiModalPlugin.process_input()for more details.
- register_dummy_data(factory: Callable[[int, ModelConfig, VisionLanguageConfig], Tuple[SequenceData, MultiModalData]])[source]#
Register a dummy data factory to a model class.
During memory profiling, the provided function is invoked to create dummy data to be inputted into the model. The modality and shape of the dummy data should be an upper bound of what the model would receive at inference time.
- register_image_feature_input(processor: Callable[[ImageFeatureData, ModelConfig, VisionLanguageConfig], Dict[str, torch.Tensor]] | None = None)[source]#
Register an input processor for image feature data to a model class.
See
MultiModalPlugin.register_input_processor()for more details.
- register_image_pixel_input(processor: Callable[[ImagePixelData, ModelConfig, VisionLanguageConfig], Dict[str, torch.Tensor]] | None = None)[source]#
Register an input processor for image pixel data to a model class.
See
MultiModalPlugin.register_input_processor()for more details.
- register_input(data_type: Type[D], processor: Callable[[D, ModelConfig, VisionLanguageConfig], Dict[str, torch.Tensor]] | None = None)[source]#
Register an input processor for a specific modality to a model class.
See
MultiModalPlugin.register_input_processor()for more details.
Base Classes#
- class vllm.multimodal.MultiModalData[source]#
Base class that contains multi-modal data.
To add a new modality, add a new file under
multimodaldirectory.In this new file, subclass
MultiModalDataandMultiModalPlugin.Finally, register the new plugin to
vllm.multimodal.MULTIMODAL_REGISTRY. This enables models to callMultiModalRegistry.register_input()for the new modality.
- class vllm.multimodal.MultiModalPlugin[source]#
-
Base class that defines data processing logic for a specific modality.
In particular, we adopt a registry pattern to dispatch data processing according to the model being used (considering that different models may process the same data differently). This registry is in turn used by
MultiModalRegistrywhich acts at a higher level (i.e., the modality of the data).- abstract get_data_type() Type[D][source]#
Get the modality (subclass of
MultiModalData) served by this plugin.
- process_input(data: D, model_config: ModelConfig, vlm_config: VisionLanguageConfig) Dict[str, torch.Tensor][source]#
Apply an input processor to a
MultiModalDatainstance passed to the model.The model is identified by
model_config.vlm_configis for compatibility purposes and may be merged intomodel_configin the near future.
- register_input_processor(processor: Callable[[D, ModelConfig, VisionLanguageConfig], Dict[str, torch.Tensor]] | None = None)[source]#
Register an input processor to a model class.
When the model receives input data that matches the modality served by this plugin (see
get_data_type()), the provided input processor is applied to preprocess the data. If None is provided, then the default input processor is applied instead.
Image Classes#
- class vllm.multimodal.image.ImageFeatureData(image_features: torch.Tensor)[source]#
Bases:
MultiModalDataThe feature vector of an image, passed directly to the model.
This should be the output of the vision tower.
- class vllm.multimodal.image.ImageFeaturePlugin[source]#
Bases:
MultiModalPlugin[ImageFeatureData]- get_data_type() Type[ImageFeatureData][source]#
Get the modality (subclass of
MultiModalData) served by this plugin.
- class vllm.multimodal.image.ImagePixelData(image: PIL.Image.Image | torch.Tensor)[source]#
Bases:
MultiModalDataThe pixel data of an image. Can be one of:
:class:
PIL.Image: An image object. Requires that a HuggingFace processor is available to the model.:class:
torch.Tensor: The raw pixel data which is passed to the model without additional pre-processing.
- class vllm.multimodal.image.ImagePixelPlugin[source]#
Bases:
MultiModalPlugin[ImagePixelData]- get_data_type() Type[ImagePixelData][source]#
Get the modality (subclass of
MultiModalData) served by this plugin.
- vllm.multimodal.image.get_dummy_image_data(seq_len: int, model_config: ModelConfig, vlm_config: VisionLanguageConfig) Tuple[SequenceData, MultiModalData][source]#
Standard dummy data factory for image data (to be used in
vlm.multimodal.MultiModalRegistry.register_dummy_data()).