Welcome to vLLM!#
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs and AMD GPUs
(Experimental) Prefix caching support
(Experimental) Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
Documentation#
Getting Started
Serving
Models
Quantization
Automatic Prefix Caching
Developer Documentation
- Sampling Parameters
- Offline Inference
- vLLM Engine
- LLMEngine
LLMEngineLLMEngine.DO_VALIDATE_OUTPUTLLMEngine.abort_request()LLMEngine.add_request()LLMEngine.do_log_stats()LLMEngine.from_engine_args()LLMEngine.get_decoding_config()LLMEngine.get_model_config()LLMEngine.get_num_unfinished_requests()LLMEngine.has_unfinished_requests()LLMEngine.has_unfinished_requests_for_virtual_engine()LLMEngine.step()
- AsyncLLMEngine
- LLMEngine
- vLLM Paged Attention
- Input Processing
- Multi-Modality
- Guides
- Module Contents
- Registry
MULTIMODAL_REGISTRYMultiModalRegistryMultiModalRegistry.create_input_mapper()MultiModalRegistry.get_max_multimodal_tokens()MultiModalRegistry.map_input()MultiModalRegistry.register_image_input_mapper()MultiModalRegistry.register_input_mapper()MultiModalRegistry.register_max_image_tokens()MultiModalRegistry.register_max_multimodal_tokens()
- Base Classes
- Image Classes
- Registry
- Dockerfile
Community