Skip to main content

Response generation settings

To access the parameters for generating responses to user queries, select SettingsGeneration in the menu.

System prompt

If needed, you can edit the system prompt used by the LLM to generate responses to user queries.

LLM settings

The LLM settings are applied:

  • to generate the response to a user query;
  • for chunk retrieval if LLM-based retrieval is selected;
  • for rephrasing the query and considering history if retrieval by embedding similarity is selected.

Basic settings:

  • Model: select one of the available language models. For LLM-based chunk retrieval, only models that support function calling are available, as the model calls functions to request chunks.
  • Max tokens in request: limits the number of tokens that can be sent to the LLM.
  • Max tokens in response: limits the number of tokens that the LLM can generate in one iteration.
  • Temperature adjusts the creativity level of responses. Higher temperature values produce more creative and less predictable results. We recommend adjusting either Temperature or Top P, but not both at once.

Advanced settings:

  • Top P adjusts the diversity of responses. At lower values, the LLM selects words from a smaller, more likely set. At higher values, the response becomes more diverse. We recommend adjusting either Top P or Temperature, but not both at once.

  • Presence penalty: reduces the likelihood of repeated tokens in a response. By increasing the value, you decrease the likelihood of repeating words or phrases in the response.

    All repetitions are penalised equally, no matter how frequently they occur. For example, the second appearance of a token is penalised the same as the tenth.

  • Frequency penalty: reduces the likelihood of frequently occurring tokens in a response. By increasing the value, you reduce the likelihood of words or phrases appearing multiple times in the response.

    The impact of Frequency penalty grows with the number of times a token appears in the text.