故障排除

故障排除#

本文档概述了一些您可以考虑的故障排除策略。如果您认为您发现了错误，请先搜索现有问题，看看是否已被报告。如果未被报告，请提交新问题，并提供尽可能多的相关信息。

注意

一旦您调试好问题，请记住关闭任何已定义的环境调试变量，或者简单地启动一个新的 shell 以避免受到遗留调试设置的影响。否则，系统可能会因调试功能处于激活状态而变慢。

下载模型时挂起#

如果模型尚未下载到磁盘，vLLM 将从互联网下载模型，这可能需要时间，具体取决于您的互联网连接。建议首先使用 huggingface-cli 下载模型，并将模型的本地路径传递给 vLLM。这样，您可以隔离问题。

从磁盘加载模型时挂起#

如果模型很大，则从磁盘加载模型可能需要很长时间。请注意您存储模型的位置。某些集群在节点之间共享文件系统，例如分布式文件系统或网络文件系统，这可能会很慢。最好将模型存储在本地磁盘中。此外，请查看 CPU 内存使用情况，当模型过大时，可能会占用大量 CPU 内存，从而降低操作系统速度，因为它需要频繁地在磁盘和内存之间进行交换。

注意

为了隔离模型下载和加载问题，您可以使用 --load-format dummy 参数来跳过加载模型权重。这样，您可以检查模型下载和加载是否是瓶颈。

内存不足#

如果模型太大而无法容纳在单个 GPU 中，您将收到内存不足 (OOM) 错误。考虑使用张量并行将模型拆分到多个 GPU 上。在这种情况下，每个进程将读取整个模型并将其拆分为块，这会使磁盘读取时间更长（与张量并行的大小成正比）。您可以使用 examples/offline_inference/save_sharded_state.py 将模型检查点转换为分片检查点。转换过程可能需要一些时间，但稍后您可以更快地加载分片检查点。模型加载时间应保持不变，与张量并行的大小无关。

生成质量发生变化#

在 v0.8.0 中，默认采样参数的来源在 Pull Request #12622 中进行了更改。在 v0.8.0 之前，默认采样参数来自 vLLM 的中性默认值集。从 v0.8.0 开始，默认采样参数来自模型创建者提供的 generation_config.json。

在大多数情况下，这应该会带来更高质量的响应，因为模型创建者可能知道哪些采样参数最适合他们的模型。但是，在某些情况下，模型创建者提供的默认值可能会导致性能下降。

您可以通过使用 --generation-config vllm 进行在线尝试和使用 generation_config="vllm" 进行离线尝试来检查是否发生了这种情况。如果在尝试此操作后，您的生成质量有所提高，我们建议继续使用 vLLM 默认值，并在 https://hugging-face.cn 上请愿模型创建者更新其默认 generation_config.json，以便生成更高质量的生成结果。

启用更多日志记录#

如果其他策略无法解决问题，则可能是 vLLM 实例卡在某个地方。您可以使用以下环境变量来帮助调试问题

export VLLM_LOGGING_LEVEL=DEBUG 以启用更多日志记录。
export CUDA_LAUNCH_BLOCKING=1 以识别哪个 CUDA 内核导致了问题。
export NCCL_DEBUG=TRACE 以启用更多 NCCL 日志记录。
export VLLM_TRACE_FUNCTION=1 以记录日志文件中所有函数调用以进行检查，从而判断哪个函数崩溃或挂起。

不正确的网络设置#

如果您有复杂的网络配置，vLLM 实例可能无法获取正确的 IP 地址。您可以找到类似 DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl 的日志，并且 IP 地址应该是正确的。如果不是，请使用环境变量 export VLLM_HOST_IP=<your_ip_address> 覆盖 IP 地址。

您可能还需要设置 export NCCL_SOCKET_IFNAME=<your_network_interface> 和 export GLOO_SOCKET_IFNAME=<your_network_interface> 以指定 IP 地址的网络接口。

错误 near `self.graph.replay()`#

如果 vLLM 崩溃并且错误跟踪捕获到 vllm/worker/model_runner.py 中 self.graph.replay() 附近的某个位置，则表示 CUDAGraph 内部存在 CUDA 错误。要识别导致错误的特定 CUDA 操作，您可以将 --enforce-eager 添加到命令行，或者将 enforce_eager=True 添加到 LLM 类以禁用 CUDAGraph 优化并隔离导致错误的确切 CUDA 操作。

不正确的硬件/驱动程序#

如果无法建立 GPU/CPU 通信，您可以使用以下 Python 脚本并按照以下说明确认 GPU/CPU 通信是否正常工作。

# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

if world_size <= 1:
    exit()

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    out = pynccl.all_reduce(data, stream=s)
    value = out.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

如果您正在使用单节点进行测试，请将 --nproc-per-node 调整为您要使用的 GPU 数量

NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py

如果您正在使用多节点进行测试，请根据您的设置调整 --nproc-per-node 和 --nnodes，并将 MASTER_ADDR 设置为主节点的正确 IP 地址，该地址可从所有节点访问。然后，运行

NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py

如果脚本成功运行，您应该看到消息 sanity check is successful!。

如果测试脚本挂起或崩溃，通常意味着硬件/驱动程序在某种程度上已损坏。您应该尝试联系您的系统管理员或硬件供应商以获得进一步的帮助。作为一种常见的解决方法，您可以尝试调整一些 NCCL 环境变量，例如 export NCCL_P2P_DISABLE=1 以查看是否有帮助。请查看他们的文档以获取更多信息。请仅将这些环境变量用作临时解决方法，因为它们可能会影响系统的性能。最佳解决方案仍然是修复硬件/驱动程序，以便测试脚本可以成功运行。

注意

多节点环境比单节点环境更复杂。如果您看到诸如 torch.distributed.DistNetworkError 之类的错误，则可能是网络/DNS 设置不正确。在这种情况下，您可以手动分配节点排名并通过命令行参数指定 IP

在第一个节点中，运行 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py。
在第二个节点中，运行 NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py。

根据您的设置调整 --nproc-per-node、--nnodes 和 --node-rank，确保在不同的节点上执行不同的命令（具有不同的 --node-rank）。

Python 多进程处理#

`RuntimeError` 异常#

如果您在日志中看到类似这样的警告

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
    https://docs.vllm.com.cn/en/latest/getting_started/troubleshooting.html#python-multiprocessing
    for more information.

或者来自 Python 的类似这样的错误

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.pythonlang.cn/3/library/multiprocessing.html

那么您必须更新您的 Python 代码，以在 if __name__ == '__main__': 块后保护 vllm 的使用。例如，而不是这样

import vllm

llm = vllm.LLM(...)

请尝试改为这样

if __name__ == '__main__':
    import vllm

    llm = vllm.LLM(...)

`torch.compile` 错误#

vLLM 严重依赖 torch.compile 来优化模型以获得更好的性能，这引入了对 torch.compile 功能和 triton 库的依赖。默认情况下，我们使用 torch.compile 来优化模型中的某些函数。在运行 vLLM 之前，您可以通过运行以下脚本来检查 torch.compile 是否按预期工作

import torch

@torch.compile
def f(x):
    # a simple function to test torch.compile
    x = x + 1
    x = x * 2
    x = x.sin()
    return x

x = torch.randn(4, 4).cuda()
print(f(x))

如果它从 torch/_inductor 目录引发错误，通常意味着您有一个自定义的 triton 库，该库与您正在使用的 PyTorch 版本不兼容。有关示例，请参见此问题。

模型检查失败#

如果您看到类似这样的错误

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.

这意味着 vLLM 无法导入模型文件。通常，这与 vLLM 构建中缺少依赖项或过时的二进制文件有关。请仔细阅读日志以确定错误的根本原因。

不支持的模型#

如果您看到类似这样的错误

Traceback (most recent call last):
...
  File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
    for arch in architectures:
TypeError: 'NoneType' object is not iterable

或

  File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]

但您确定该模型在支持的模型列表中，则可能是 vLLM 的模型解析存在问题。在这种情况下，请按照这些步骤显式指定模型的 vLLM 实现。

无法推断设备类型#

如果您看到类似 RuntimeError: Failed to infer device type 的错误，则表示 vLLM 无法推断运行时环境的设备类型。您可以查看代码，了解 vLLM 如何推断设备类型以及为何无法按预期工作。在此 PR 之后，您还可以设置环境变量 VLLM_LOGGING_LEVEL=DEBUG 以查看更详细的日志，从而帮助调试问题。

已知问题#

在 v0.5.2、v0.5.3 和 v0.5.3.post1 中，存在由 zmq 引起的错误，该错误有时会导致 vLLM 挂起，具体取决于机器配置。解决方案是升级到最新版本的 vllm 以包含修复。
为了规避 NCCL 错误，所有 vLLM 进程都将设置环境变量 NCCL_CUMEM_ENABLE=0 以禁用 NCCL 的 cuMem 分配器。它不影响性能，但仅提供内存优势。当外部进程想要与 vLLM 的进程建立 NCCL 连接时，它们也应设置此环境变量，否则，不一致的环境设置将导致 NCCL 挂起或崩溃，如 RLHF 集成和讨论中所观察到的那样。