简介
学习更多关于 AI 大模型全栈知识👇

:::color5

vLLM：通过 PagedAttention 轻松、快速且廉价地提供 LLM 服务。 vLLM 是一个 Python 库，还包含预编译的 C++ 和 CUDA (12.1) 二进制文件。

:::

官方网址: https://vllm.ai
官方 github 地址：https://github.com/vllm-project/vllm
支持的模型：https://vllm.readthedocs.io/en/latest/models/supported_models.html
快速开始：https://docs.vllm.ai/en/latest/getting_started/installation.html

简介

vLLM 是一个快速且易于使用的 LLM 推理和服务库。

vLLM 的速度很快：

最先进的服务吞吐量
使用PagedAttention高效管理注意力键和值内存
连续批处理传入请求
使用 CUDA/HIP 图快速执行模型
量化：GPTQ、AWQ、SqueezeLLM、FP8 KV 缓存
优化的 CUDA 内核

vLLM 灵活且易于使用：

与流行的 Hugging Face 模型无缝集成
高吞吐量服务与各种解码算法，包括并行采样、波束搜索等
对分布式推理的张量并行支持
流输出
兼容 OpenAI 的 API 服务器
支持 NVIDIA GPU 和 AMD GPU
（实验性）前缀缓存支持
（实验性）多lora支持

vLLM 无缝支持许多 Hugging Face 模型，包括以下架构：

天鹰座和天鹰座2（BAAI/AquilaChat2-7B、BAAI/AquilaChat2-34B、BAAI/Aquila-7B、BAAI/AquilaChat-7B等）
百川 & 百川2 ( baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, 等)
绽放（bigscience/bloom、bigscience/bloomz等）
ChatGLM（THUDM/chatglm2-6b、THUDM/chatglm3-6b等）
Command-R（CohereForAI/c4ai-command-r-v01等）
DBRX（databricks/dbrx-base等databricks/dbrx-instruct）
DeciLM ( Deci/DeciLM-7B、Deci/DeciLM-7B-instruct等)
猎鹰（tiiuae/falcon-7b、tiiuae/falcon-40b、tiiuae/falcon-rw-7b等）
杰玛（google/gemma-2b、google/gemma-7b等）
GPT-2（gpt2、gpt2-xl等）
GPT BigCode（bigcode/starcoder、bigcode/gpt_bigcode-santacoder等）
GPT-J（EleutherAI/gpt-j-6b、nomic-ai/gpt4all-j等）
GPT-NeoX（EleutherAI/gpt-neox-20b、databricks/dolly-v2-12b、stabilityai/stablelm-tuned-alpha-7b等）
实习生LM（internlm/internlm-7b、internlm/internlm-chat-7b等）
实习生LM2（internlm/internlm2-7b、internlm/internlm2-chat-7b等）
贾斯 ( core42/jais-13b、core42/jais-13b-chat、core42/jais-30b-v3、core42/jais-30b-chat-v3等)
LLaMA、Llama 2 和 Meta Llama 3（meta-llama/Meta-Llama-3-8B-Instruct、meta-llama/Meta-Llama-3-70B-Instruct、meta-llama/Llama-2-70b-hf、lmsys/vicuna-13b-v1.3、young-geng/koala、openlm-research/open_llama_13b等）
最小每千次展示费用（openbmb/MiniCPM-2B-sft-bf16、openbmb/MiniCPM-2B-dpo-bf16等）
米斯特拉尔（mistralai/Mistral-7B-v0.1、mistralai/Mistral-7B-Instruct-v0.1等）
混合（mistralai/Mixtral-8x7B-v0.1、mistralai/Mixtral-8x7B-Instruct-v0.1、mistral-community/Mixtral-8x22B-v0.1等）
MPT（mosaicml/mpt-7b、mosaicml/mpt-30b等）
OLMo ( allenai/OLMo-1B-hf、allenai/OLMo-7B-hf等)
OPT（facebook/opt-66b、facebook/opt-iml-max-30b等）
猎户座（OrionStarAI/Orion-14B-Base、OrionStarAI/Orion-14B-Chat等）
Φ ( microsoft/phi-1_5、microsoft/phi-2等)
Phi-3（microsoft/Phi-3-mini-4k-instruct、microsoft/Phi-3-mini-128k-instruct等）
Qwen ( Qwen/Qwen-7B、Qwen/Qwen-7B-Chat等)
Qwen2 ( Qwen/Qwen1.5-7B、Qwen/Qwen1.5-7B-Chat等)
Qwen2MoE ( Qwen/Qwen1.5-MoE-A2.7B、Qwen/Qwen1.5-MoE-A2.7B-Chat等)
StableLM( stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, 等)
Starcoder2( bigcode/starcoder2-3b、bigcode/starcoder2-7b、bigcode/starcoder2-15b等)
Xverse（、、、xverse/XVERSE-7B-Chat等）xverse/XVERSE-13B-Chatxverse/XVERSE-65B-Chat
易 ( 01-ai/Yi-6B、01-ai/Yi-34B等)