Bases: PretrainedConfig
This is the configuration class to store the configuration of a [UltravoxForConditionalGeneration]. It is used to instantiate an Ultravox model according to the specified arguments, defining the model architecture.
Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| audio_config | `Union[AutoConfig, dict]`, *optional* | Custom audio config or dict. | None | 
| text_config | `Union[AutoConfig, dict]`, *optional* | The config object of the text backbone. | None | 
| audio_model_id | `str`, *optional* | The model ID of the audio backbone. | None | 
| text_model_id | `str`, *optional* | The model ID of the text backbone. | None | 
| ignore_index | `int`, *optional*, defaults to -100 | The ignore index for the loss function. | -100 | 
| audio_token_index | `int`, *optional*, defaults to 32000 | The audio token index to encode the audio prompt. | 32000 | 
| stack_factor | `int`, *optional*, defaults to 8 | Audio downsampling factor for the multimodal projector. | 8 | 
| norm_init | `float`, *optional*, defaults to 0.4 | The initialization value for the layer normalization. | 0.4 | 
| projector_act | `str`, *optional*, defaults to `"swiglu"` | The activation function used by the multimodal projector. | 'swiglu' | 
| projector_ln_mid | `bool`, *optional*, defaults to `False` | Whether to apply layer normalization at the middle of the projector or at the end. Versions v0.4.1 and below use  | False | 
Source code in vllm/transformers_utils/configs/ultravox.py
 | 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |  | 
 instance-attribute  ¶
   
 __init__(
    audio_config: dict[str, Any] | None = None,
    text_config: dict[str, Any] | None = None,
    audio_model_id: str | None = None,
    text_model_id: str | None = None,
    ignore_index: int = -100,
    audio_token_index: int = 32000,
    hidden_size: int = 4096,
    stack_factor: int = 8,
    norm_init: float = 0.4,
    projector_act: str = "swiglu",
    projector_ln_mid: bool = False,
    **kwargs,
)