Chunk retrieval settings
You can find settings for retrieving chunks in the Settings → Retrieval section.
The Chunk retrieval method setting determines how chunks are extracted from sources for further processing, such as re-ranking or response generation:
- By embedding similarity: the system compares vector representations (embeddings) of the user’s query and chunks.
- Using LLM: the language model initiates chunk retrieval, analyses them and requests more chunks. The retrieval process finishes when the model identifies the most relevant chunks.
The available settings depend on the selected chunk retrieval method.
- By embedding similarity
- Using LLM
Parameters of chunk retrieval from sources
-
Number of retrieval results: specifies the number of found chunks to process.
tipIf you are going to use re-ranking, increase the number of retrieval results. This way, you can choose relevant chunks from a larger pool of candidates.
-
Number of adjacent chunks: specifies the number of adjacent chunks included with each found chunk. Adjacent chunks help the model understand the query’s context better and generate more accurate responses.
For example, if you set the Number of retrieval results to 10 and the Number of adjacent chunks to 3, then 3 preceding and 3 following chunks are included with each found chunk. As a result, the size of each found chunk increases, and 10 larger chunks are sent for further processing.
-
Search by chunk metadata: if enabled, the system searches for relevant chunks based on their content and additional metadata fields populated during indexing.
-
Rephrase user’s query: if enabled, the system attempts to reformulate the query to make it clearer. This can improve the quality of search results.
You can edit the prompt for rephrasing the query if needed.
-
Full-text search: if enabled, semantic search is complemented by full-text search.
Semantic search selects chunks relevant to the query’s meaning, even if they do not contain exact words from the query. Full-text search selects chunks that match the exact words in the query.
Select your full-text search strategy:
-
The Hybrid strategy combines the results of semantic and full-text search, returning the chunks both matching the query’s meaning and containing words from the query. The total number of results is capped by the Number of retrieval results specified above.
-
The Weighted strategy enables you to combine the results of semantic and full-text search in the desired ratio.
Specify the maximum number of results for each search type to include in the final list for further processing.
For example, if the Number of retrieval results is 10, the Semantic portion is 8, and the Full-text search portion is 4, the top 10 results from each search type are selected. The results are merged, discarding the duplicates (chunks found by both search types) and results exceeding the portion sizes. The final number of results does not exceed 12 (8 + 4).
-
The With threshold strategy runs semantic search first. If it does not return enough relevant results, full-text search is applied.
Specify the relevance score threshold for semantic search results. If all the results fall below the threshold, the search switches to full-text mode.
-
-
Consider chat history: if enabled, the system takes into account the user’s previous queries during chunk retrieval. This helps the model better understand the context and generate more accurate responses.
To consider the history, adjust the following settings:
- Prompt.
- The maximum history size in tokens that can be sent to the LLM.
- The minimum number of user’s queries in the chat history.
cautionIf the history size exceeds the token limit, messages are deleted one at a time until the required size or the minimum number of user queries is reached. If the minimum number of queries is reached first, the history is sent to the LLM even if it exceeds the size limit, which might cause an error. If the error persists, try reducing the minimum number of queries in the history.
Re-ranking
Re-ranking is performed after chunk retrieval from source documents to re-evaluate the relevance of chunks to the user’s query and to select the most relevant ones. If the retrieved chunks are sufficiently relevant to the query, you may disable re-ranking.
Select the re-ranking method:
- Empirical: an algorithm based on Tovie AI’s experience.
- Using a model: performed by a specialised re-ranking model.
Empirical ranking is faster than model-based ranking, but may be less accurate.
Re-ranking settings vary depending on the selected method.
- Empirical
- Using model
- Number of retrieval results: specifies the maximum number of results sent to the LLM along with the user’s query for response generation after re-ranking.
- Number of results from a single document: ensures that chunks from a single source document do not take up the entire response generation context.
- Threshold score: results with a relevance score below the threshold are excluded from response generation.
- Maximum deviation from the best score, %: specifies the allowed deviation from the score of the most relevant chunk. Results with a deviation above the allowed value are excluded from response generation.
- Number of retrieval results: specifies the maximum number of results sent to the LLM along with the user’s query for response generation after re-ranking.
- Model: specifies the model used for re-ranking.
Abbreviations
If your source documents contain abbreviations or contractions that differ from common usage, you can clarify them for the system. To do this, upload a JSON file containing these abbreviations with explanations. You can download a template file from the interface.
-
Number of retrieval results: specifies the number of found chunks to process.
-
Number of adjacent chunks: specifies the number of adjacent chunks included with each found chunk. Adjacent chunks help the model understand the query’s context better and generate more accurate responses.
For example, if you set the Number of retrieval results to 10 and the Number of adjacent chunks to 3, then 3 preceding and 3 following chunks are included with each found chunk. As a result, the size of each found chunk increases, and 10 larger chunks are sent for further processing.
LLM-based chunk retrieval also utilizes the LLM settings configured in the Settings → Generation section.