故障排除¶

本文档概述了一些您可以考虑的故障排除策略。如果您认为发现了错误，请先搜索现有问题，看看是否已被报告。如果尚未报告，请提交新问题，并提供尽可能多的相关信息。

注意

调试完问题后，请记住关闭任何已定义的调试环境变量，或者简单地启动一个新的shell以避免受残留调试设置的影响。否则，系统可能会因为调试功能仍然激活而变慢。

模型下载卡住¶

如果模型尚未下载到磁盘，vLLM 将从互联网下载它，这可能需要时间并取决于您的互联网连接。建议先使用 huggingface-cli 下载模型，然后将模型的本地路径传递给 vLLM。这样，您可以隔离问题。

从磁盘加载模型卡住¶

如果模型很大，从磁盘加载它可能需要很长时间。请注意您存储模型的位置。有些集群在节点之间共享文件系统，例如分布式文件系统或网络文件系统，这可能会很慢。最好将模型存储在本地磁盘中。此外，请查看 CPU 内存使用情况，当模型过大时，它可能会占用大量 CPU 内存，从而降低操作系统的速度，因为它需要频繁地在磁盘和内存之间进行交换。

注意

为了隔离模型下载和加载问题，您可以使用 `--load-format dummy` 参数来跳过加载模型权重。通过这种方式，您可以检查模型下载和加载是否是瓶颈。

内存不足¶

如果模型太大无法放入单个 GPU，您将遇到内存不足（OOM）错误。考虑采用这些选项来减少内存消耗。

生成质量变化¶

在 v0.8.0 中，默认采样参数的来源在拉取请求 #12622 中进行了更改。在 v0.8.0 之前，默认采样参数来自 vLLM 的一组中性默认值。从 v0.8.0 开始，默认采样参数来自模型创建者提供的 generation_config.json。

在大多数情况下，这应该会带来更高质量的响应，因为模型创建者可能知道哪些采样参数最适合他们的模型。然而，在某些情况下，模型创建者提供的默认值可能会导致性能下降。

您可以通过在线尝试使用旧的默认值 `--generation-config vllm` 和离线尝试使用 `generation_config="vllm"` 来检查是否发生这种情况。如果尝试后您的生成质量有所提高，我们建议继续使用 vLLM 默认值，并向 https://hugging-face.cn 上的模型创建者请求更新他们的默认 `generation_config.json`，以便产生更高质量的生成。

启用更多日志记录¶

如果其他策略无法解决问题，很可能是 vLLM 实例卡在某个地方。您可以使用以下环境变量来帮助调试问题

export VLLM_LOGGING_LEVEL=DEBUG 以开启更多日志记录。
export CUDA_LAUNCH_BLOCKING=1 以识别是哪个 CUDA 内核导致了问题。
export NCCL_DEBUG=TRACE 以开启 NCCL 的更多日志记录。
export VLLM_TRACE_FUNCTION=1 用于记录所有函数调用，以便在日志文件中检查，以判断哪个函数崩溃或挂起。除非绝对需要调试，否则不要使用此标志，它会导致启动时间显著延迟。

网络设置不正确¶

如果您的网络配置复杂，vLLM 实例可能无法获取正确的 IP 地址。您可能会在日志中看到类似 `DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl` 的内容，其中 IP 地址应该是正确的。如果不是，请使用环境变量 `export VLLM_HOST_IP=` 覆盖 IP 地址。

您可能还需要设置 `export NCCL_SOCKET_IFNAME=` 和 `export GLOO_SOCKET_IFNAME=` 来指定 IP 地址的网络接口。

`self.graph.replay()` 附近的错误¶

如果 vLLM 崩溃并且错误跟踪在 `vllm/worker/model_runner.py` 中的 `self.graph.replay()` 附近捕获到错误，则它是 CUDAGraph 内部的 CUDA 错误。要识别导致错误的特定 CUDA 操作，您可以在命令行中添加 `--enforce-eager`，或将 `enforce_eager=True` 添加到 LLM 类中，以禁用 CUDAGraph 优化并隔离导致错误的精确 CUDA 操作。

硬件/驱动不正确¶

如果无法建立 GPU/CPU 通信，您可以使用以下 Python 脚本并按照下面的说明确认 GPU/CPU 通信是否正常工作。

代码

# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

if world_size <= 1:
    exit()

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    out = pynccl.all_reduce(data, stream=s)
    value = out.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

如果您在单个节点上进行测试，请将 `--nproc-per-node` 调整为您要使用的 GPU 数量

NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py

如果您在多节点上进行测试，请根据您的设置调整 `--nproc-per-node` 和 `--nnodes`，并将 `MASTER_ADDR` 设置为主节点的正确 IP 地址，该地址应可从所有节点访问。然后，运行

NCCL_DEBUG=TRACE torchrun --nnodes 2 \
    --nproc-per-node=2 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR test.py

如果脚本成功运行，您应该会看到消息 `sanity check is successful!`。

如果测试脚本挂起或崩溃，通常意味着硬件/驱动程序在某种程度上损坏。您应该尝试联系您的系统管理员或硬件供应商以获取进一步帮助。作为一种常见的变通方法，您可以尝试调整一些 NCCL 环境变量，例如 `export NCCL_P2P_DISABLE=1`，看看是否有帮助。请查阅它们的文档以获取更多信息。请仅将这些环境变量用作临时解决方案，因为它们可能会影响系统的性能。最好的解决方案仍然是修复硬件/驱动程序，以便测试脚本能够成功运行。

注意

多节点环境比单节点环境更复杂。如果您看到诸如 `torch.distributed.DistNetworkError` 之类的错误，很可能是网络/DNS 设置不正确。在这种情况下，您可以通过命令行参数手动分配节点排名并指定 IP

在第一个节点中，运行 `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`。
在第二个节点中，运行 `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`。

根据您的设置调整 `--nproc-per-node`、`--nnodes` 和 `--node-rank`，确保在不同节点上执行不同的命令（带有不同的 `--node-rank`）。

Python 多进程¶

`RuntimeError` 异常¶

如果您在日志中看到类似这样的警告

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
    https://docs.vllm.com.cn/en/latest/usage/troubleshooting.html#python-multiprocessing
    for more information.

或者 Python 报错信息类似这样

日志

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.pythonlang.cn/3/library/multiprocessing.html

那么您必须更新您的 Python 代码，将 `vllm` 的使用置于 `if __name__ == '__main__':` 块之后。例如，不是这样

import vllm

llm = vllm.LLM(...)

而是尝试这样

if __name__ == '__main__':
    import vllm

    llm = vllm.LLM(...)

`torch.compile` 错误¶

vLLM 严重依赖 `torch.compile` 来优化模型以获得更好的性能，这引入了对 `torch.compile` 功能和 `triton` 库的依赖。默认情况下，我们使用 `torch.compile` 来优化模型中的一些函数。在运行 vLLM 之前，您可以通过运行以下脚本来检查 `torch.compile` 是否按预期工作

代码

import torch

@torch.compile
def f(x):
    # a simple function to test torch.compile
    x = x + 1
    x = x * 2
    x = x.sin()
    return x

x = torch.randn(4, 4).cuda()
print(f(x))

如果它从 `torch/_inductor` 目录引发错误，通常意味着您有一个与您正在使用的 PyTorch 版本不兼容的自定义 `triton` 库。例如，请参阅问题 #12219。

模型检查失败¶

如果您看到类似这样的错误

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.

这意味着 vLLM 未能导入模型文件。通常，这与 vLLM 构建中缺少依赖项或二进制文件过时有关。请仔细阅读日志以确定错误的根本原因。

不支持的模型¶

如果您看到类似这样的错误

Traceback (most recent call last):
...
  File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
    for arch in architectures:
TypeError: 'NoneType' object is not iterable

或

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]

但您确定模型在支持模型列表中，那么 vLLM 的模型解析可能存在一些问题。在这种情况下，请遵循这些步骤来显式指定模型的 vLLM 实现。

无法推断设备类型¶

如果您看到类似 `RuntimeError: Failed to infer device type` 的错误，这意味着 vLLM 未能推断运行时环境的设备类型。您可以查看代码，了解 vLLM 如何推断设备类型以及为什么它没有按预期工作。在此拉取请求之后，您还可以设置环境变量 `VLLM_LOGGING_LEVEL=DEBUG` 以查看更详细的日志来帮助调试问题。

NCCL 错误：在 `ncclCommInitRank` 期间发生未处理的系统错误¶

如果您的服务工作负载使用 GPUDirect RDMA 进行跨多节点的分布式服务，并在 `ncclCommInitRank` 期间遇到错误，即使设置了 `NCCL_DEBUG=INFO` 也没有清晰的错误消息，它可能看起来像这样

Error executing method 'init_device'. This might cause deadlock in distributed execution.
Traceback (most recent call last):
...
   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
     raise RuntimeError(f"NCCL error: {error_str}")
 RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
...

这表明 vLLM 未能初始化 NCCL 通信器，可能是由于缺少 `IPC_LOCK` linux 能力或 `/dev/shm` 未挂载。请参阅分布式推理与服务，获取关于正确配置分布式服务环境的指导。

已知问题¶

在 `v0.5.2`、`v0.5.3` 和 `v0.5.3.post1` 中，存在一个由 zmq 引起的错误，这有时可能导致 vLLM 根据机器配置而挂起。解决方案是升级到最新版本的 `vllm` 以包含修复。
为了规避 NCCL 的一个错误，所有 vLLM 进程都将设置环境变量 `NCCL_CUMEM_ENABLE=0` 以禁用 NCCL 的 `cuMem` 分配器。这不会影响性能，但只会带来内存上的好处。当外部进程希望与 vLLM 进程建立 NCCL 连接时，它们也应该设置此环境变量，否则，环境设置不一致将导致 NCCL 挂起或崩溃，正如在 RLHF 集成和讨论中观察到的那样。