Shared resampler perceiver network used in multimodal models and related helpers for sincos positional embeddings.
Example models: Qwen (Qwen-VL), MiniCPM-V 2.0
 
  Bases: Module
A 2D perceiver-resampler network with one cross attention layers by (grid_size2) learnable queries and 2d sincos pos_emb. Outputs: A tensor with the shape of (grid_size2, embed_dim)
Source code in vllm/model_executor/layers/resampler.py
  instance-attribute  ¶
 kv_proj = ReplicatedLinear(
    kv_dim,
    embed_dim,
    bias=False,
    quant_config=quant_config,
    prefix=f"{prefix}.kv_proj",
)
 
 __init__(
    num_queries: int,
    embed_dim: int,
    num_heads: int,
    kv_dim: int | None = None,
    norm_layer: Callable[[int], LayerNorm] = DEFAULT_LN,
    do_post_projection: bool = True,
    quant_config: QuantizationConfig | None = None,
    prefix: str = "",
) -> None
Source code in vllm/model_executor/layers/resampler.py
  
  Bases: BaseResampler
Resampler-perceiver network to be used for a variety of model types, e.g., Qwen-vl / Minicpmv 2.0. The main difference is the addition of the do_post_projection arg, which indicates whether or not there should be a post layer normalization and projector after the attention. This is present in minicpmv2.0, but not qwen-vl.
Source code in vllm/model_executor/layers/resampler.py
  
 __init__(
    grid_size: int,
    embed_dim: int,
    num_heads: int,
    kv_dim: int | None = None,
    norm_layer: Callable[[int], LayerNorm] = DEFAULT_LN,
    adaptive: bool = False,
    do_post_projection: bool = True,
    quant_config: QuantizationConfig | None = None,
    prefix: str = "",
) -> None
Source code in vllm/model_executor/layers/resampler.py
  
  Source code in vllm/model_executor/layers/resampler.py
  
 get_1d_sincos_pos_embed_from_grid(
    embed_dim: int,
    pos: ndarray,
    version: tuple[int, int] = (2, 0),
) -> Tensor
embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) / (H, W) out: (M, D) / (H, W, D)
Source code in vllm/model_executor/layers/resampler.py
  
 get_2d_sincos_pos_embed(
    embed_dim: int,
    grid_size: int | tuple[int, int],
    cls_token: bool = False,
    version: tuple[int, int] = (2, 0),
) -> Tensor
grid_size: int of the grid height and width return: pos_embed: [grid_sizegrid_size, embed_dim] or [1+grid_sizegrid_size, embed_dim] (w/ or w/o cls_token)
Source code in vllm/model_executor/layers/resampler.py
  
 get_2d_sincos_pos_embed_from_grid(
    embed_dim: int,
    grid: ndarray,
    version: tuple[int, int] = (2, 0),
) -> Tensor