vLLM is currently one of the fastest inference engines for large language models (LLMs). It supports a wide range of model architectures and quantization methods.
vLLM also supports vision-language models (VLMs) with multimodal inputs containing both images and text prompts. For instance, vLLM can now serve models like Phi-3.5 Vision and Pixtral, which excel at tasks such as image captioning, optical character recognition (OCR), and visual question answering (VQA).
In this article, I will show you how to use VLMs with vLLM, focusing on key parameters that impact memory consumption. We will see why VLMs consume much more memory than standard LLMs. We’ll use Phi-3.5 Vision and Pixtral as case studies for a multimodal application that processes prompts containing text and images.
The code for running Phi-3.5 Vision and Pixtral with vLLM is provided in this notebook:
In transformer models, generating text token by token is slow because each prediction depends on all previous tokens…