Skip to main content
Ctrl+K

You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.

vLLM - Home

Getting Started

  • Installation
  • Installation with ROCm
  • Installation with OpenVINO
  • Installation with CPU
  • Installation with Neuron
  • Installation with TPU
  • Installation with XPU
  • Quickstart
  • Debugging Tips
  • Examples
    • API Client
    • Aqlm Example
    • Gradio OpenAI Chatbot Webserver
    • Gradio Webserver
    • Llava Example
    • Llava Next Example
    • LLM Engine Example
    • Lora With Quantization Inference
    • MultiLoRA Inference
    • Offline Bench
    • Offline Inference
    • Offline Inference 24
    • Offline Inference Arctic
    • Offline Inference Distributed
    • Offline Inference Embedding
    • Offline Inference Mlpspeculator
    • Offline Inference Neuron
    • Offline Inference Sparse
    • Offline Inference With Prefix
    • Offline Profile
    • OpenAI Chat Completion Client
    • OpenAI Completion Client
    • OpenAI Embedding Client
    • OpenAI Vision API Client
    • Phi3V Example
    • Save Sharded State
    • Tensorize vLLM Model

Serving

  • OpenAI Compatible Server
  • Deploying with Docker
  • Distributed Inference and Serving
  • Production Metrics
  • Environment Variables
  • Usage Stats Collection
  • Integrations
    • Deploying and scaling up with SkyPilot
    • Deploying with KServe
    • Deploying with NVIDIA Triton
    • Deploying with BentoML
    • Deploying with Cerebrium
    • Deploying with LWS
    • Deploying with dstack
    • Serving with Langchain
  • Loading Models with CoreWeave’s Tensorizer
  • Frequently Asked Questions

Models

  • Supported Models
  • Adding a New Model
  • Engine Arguments
  • Using LoRA adapters
  • Using VLMs
  • Speculative decoding in vLLM
  • Performance and Tuning

Quantization

  • Supported Hardware for Quantization Kernels
  • AutoAWQ
  • FP8
  • FP8 E5M2 KV Cache
  • FP8 E4M3 KV Cache

Automatic Prefix Caching

  • Introduction
  • Implementation

Developer Documentation

  • Sampling Parameters
  • Offline Inference
    • LLM Class
    • LLM Inputs
  • vLLM Engine
    • LLMEngine
    • AsyncLLMEngine
  • vLLM Paged Attention
  • Input Processing
    • Input Processing Pipeline
  • Multi-Modality
    • Adding a New Multimodal Model
  • Dockerfile

Community

  • vLLM Meetups
  • Sponsors
  • Repository
  • Suggest edit
  • .rst

Offline Inference

Offline Inference#

  • LLM Class
  • LLM Inputs

previous

Sampling Parameters

next

LLM Class

By the vLLM Team

© Copyright 2024, vLLM Team.