Bases: Attention
Cross-attention for encoder-decoder models. Handles attention between decoder queries and encoder keys/values.
Source code in vllm/attention/layers/cross_attention.py
  
 __init__(
    num_heads: int,
    head_size: int,
    scale: float,
    cache_config: CacheConfig | None = None,
    attn_type: str | None = None,
    **kwargs,
)
Source code in vllm/attention/layers/cross_attention.py
  
 get_kv_cache_spec(vllm_config: VllmConfig) -> KVCacheSpec
Source code in vllm/attention/layers/cross_attention.py
   
 _get_cross_slot_mapping(
    encoder_seq_lens: ndarray,
    block_table_tensor: Tensor,
    kv_cache_spec: CrossAttentionSpec,
    device: device,
) -> Tensor
Get cross-attention slot mappings.
Source code in vllm/attention/layers/cross_attention.py
  
 _get_max_encoder_len(vllm_config: VllmConfig) -> int
Gets the max number of encoder input tokens from the config.
Source code in vllm/attention/layers/cross_attention.py
  cached  ¶
 create_cross_attention_backend(
    underlying_attn_backend: AttentionBackend,
) -> type[AttentionBackend]