module-attribute  ¶
 STR_DTYPE_TO_TORCH_DTYPE = {
    "float32": float32,
    "half": half,
    "bfloat16": bfloat16,
    "float": float,
    "fp8": uint8,
    "fp8_e4m3": uint8,
    "fp8_e5m2": uint8,
    "int8": int8,
    "fp8_inc": float8_e4m3fn,
    "fp8_ds_mla": uint8,
}
 module-attribute  ¶
 TORCH_DTYPE_TO_NUMPY_DTYPE = {
    float16: float16,
    float32: float32,
    float64: float64,
    uint8: uint8,
    int32: int32,
    int64: int64,
}
 
 Source code in vllm/utils/torch_utils.py
   cached  ¶
  Source code in vllm/utils/torch_utils.py
  
  Source code in vllm/utils/torch_utils.py
  
    
  Source code in vllm/utils/torch_utils.py
  
    
 async_tensor_h2d(
    data: list,
    dtype: dtype,
    target_device: str | device,
    pin_memory: bool,
) -> Tensor
Asynchronously create a tensor and copy it from host to device.
Source code in vllm/utils/torch_utils.py
  
 common_broadcastable_dtype(dtypes: Collection[dtype])
Get the common dtype where all of the other dtypes can be cast to it without losing any information.
Source code in vllm/utils/torch_utils.py
  
 create_kv_caches_with_random(
    num_blocks: int,
    block_size: int,
    num_layers: int,
    num_heads: int,
    head_size: int,
    cache_dtype: str | dtype | None,
    model_dtype: str | dtype | None = None,
    seed: int | None = None,
    device: str | None = "cuda",
) -> tuple[list[Tensor], list[Tensor]]
Source code in vllm/utils/torch_utils.py
  
 create_kv_caches_with_random_flash(
    num_blocks: int,
    block_size: int,
    num_layers: int,
    num_heads: int,
    head_size: int,
    cache_dtype: str | dtype | None,
    model_dtype: str | dtype | None = None,
    seed: int | None = None,
    device: str | None = "cuda",
    cache_layout: str | None = "NHD",
) -> tuple[list[Tensor], list[Tensor]]
Source code in vllm/utils/torch_utils.py
  
 cuda_device_count_stateless() -> int
Get number of CUDA devices, caching based on the value of CUDA_VISIBLE_DEVICES at the time of call.
This should be used instead of torch.cuda.device_count() unless CUDA_VISIBLE_DEVICES has already been set to the desired value.
Source code in vllm/utils/torch_utils.py
  
 current_stream() -> Stream
replace torch.cuda.current_stream() with vllm.utils.current_stream(). it turns out that torch.cuda.current_stream() is quite expensive, as it will construct a new stream object at each call. here we patch torch.cuda.set_stream to keep track of the current stream directly, so that we can avoid calling torch.cuda.current_stream().
the underlying hypothesis is that we do not call torch._C._cuda_setStream from C/C++ code.
Source code in vllm/utils/torch_utils.py
  
 direct_register_custom_op(
    op_name: str,
    op_func: Callable,
    mutates_args: list[str] | None = None,
    fake_impl: Callable | None = None,
    target_lib: Library | None = None,
    dispatch_key: str | None = None,
    tags: tuple[Tag, ...] = (),
)
torch.library.custom_op can have significant overhead because it needs to consider complicated dispatching logic. This function directly registers a custom op and dispatches it to the CUDA backend. See https://gist.github.com/youkaichao/ecbea9ec9fc79a45d2adce1784d7a9a5 for more details.
By default, the custom op is registered to the vLLM library. If you want to register it to a different library, you can pass the library object to the target_lib argument.
IMPORTANT: the lifetime of the operator is tied to the lifetime of the library object. If you want to bind the operator to a different library, make sure the library object is alive when the operator is used.
Source code in vllm/utils/torch_utils.py
  
  Get a CUDA view of a CPU tensor using Unified Virtual Addressing (UVA).
Source code in vllm/utils/torch_utils.py
  
    
 get_kv_cache_torch_dtype(
    cache_dtype: str | dtype | None,
    model_dtype: str | dtype | None = None,
) -> dtype
Source code in vllm/utils/torch_utils.py
  
  Test whether it is lossless to cast a tensor from src_dtype to tgt_dtype.
Source code in vllm/utils/torch_utils.py
  
  Check if the installed torch version is == the target version.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| target | str | a version string, like "2.6.0". | required | 
Returns:
| Type | Description | 
|---|---|
| bool | Whether the condition meets. | 
Source code in vllm/utils/torch_utils.py
  
  Check if the installed torch version is >= the target version.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| target | str | a version string, like "2.6.0". | required | 
Returns:
| Type | Description | 
|---|---|
| bool | Whether the condition meets. | 
Source code in vllm/utils/torch_utils.py
  
 kv_cache_dtype_str_to_dtype(
    kv_cache_dtype: str, model_config: ModelConfig
) -> dtype
Source code in vllm/utils/torch_utils.py
  
 make_ndarray_with_pad(
    x: list[list[T]],
    pad: T,
    dtype: DTypeLike,
    *,
    max_len: int | None = None,
) -> NDArray
Make a padded array from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len.
Source code in vllm/utils/torch_utils.py
  
 make_tensor_with_pad(
    x: list[list[T]],
    pad: T,
    dtype: dtype,
    *,
    max_len: int | None = None,
    device: str | device | None = None,
    pin_memory: bool = False,
) -> Tensor
Make a padded tensor from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len.
Source code in vllm/utils/torch_utils.py
  
 set_default_torch_dtype(dtype: dtype)
Sets the default torch dtype to the given dtype.
Source code in vllm/utils/torch_utils.py
   
 set_default_torch_num_threads(num_threads: int)
Sets the default number of threads for PyTorch to the given value.
Source code in vllm/utils/torch_utils.py
   
  Create a weak reference to a tensor. The new tensor will share the same data as the original tensor, but will not keep the original tensor alive.
Source code in vllm/utils/torch_utils.py
  
 weak_ref_tensors(
    tensors: Tensor
    | list[Tensor]
    | tuple[Tensor]
    | IntermediateTensors,
) -> Tensor | list[Any] | tuple[Any] | Any
Convenience function to create weak references to tensors, for single tensor, list of tensors or tuple of tensors.