Indexing settings
Indexing the data loaded into the knowledge base consists of several stages:
- Data processing: converting text into MD format.
- Chunking: dividing text into fragments (chunks).
- Vectorisation: converting the chunks into vector representations (embeddings).
In the Project settings → Indexing tab, you can change the parameters for chunking and vectorisation.
- To view project settings, you need at least the
KHUB_EDITORrole. - To edit project settings, you need the
KHUB_OWNERorKHUB_ADMINrole.
After changing the indexing settings, you’ll need to re-index the knowledge base.
Vectorisation
The Vectoriser model parameter determines the language model for text vectorisation. This model will vectorise both your data and user queries:
- text-embedding-3-large: a model by OpenAI. When using the model, your data is sent to their server.
- intfloat/multilingual-e5-large: a model hosted on Tovie AI’s servers.
Chunking
Chunking method
The Chunking method parameter determines how the text will be split into chunks:
- By length: The text will be chunked by length, considering word boundaries.
- Using LLM: The text will be chunked using a language model. In this case, chunking is based on the text hierarchy, such as headings, paragraphs, and section or document titles.
The list of settings depends on the selected chunking method.
- By length
- Using LLM
-
Max chunk size in characters.
How the text will be chunked
Suppose the *Max chunk size in characters* setting is 70.
You have a text consisting of 2 sentences, 100 characters each.
The text will be divided into 3 chunks:
1. 70 characters from the first sentence.
2. The remaining 30 characters from the first sentence and 40 from the second one.
3. The remaining 60 characters from the second sentence. -
Language: the language of the source documents. This setting helps chunk the text correctly. If your sources are in several languages, select the one most used in queries to the knowledge base.
- Average chunk size in tokens: A text unit smaller than this value will not be split into smaller semantic parts (e.g., a document will not be split into chapters).
- Special chunking for large tables: If enabled, large tables that the model cannot process are split into parts. Each chunk includes the column headers to help the model better understand the data structure and generate a more accurate response.
Data preparation using LLM
Enabling these options may significantly increase your costs.
- Enrich chunks: Add additional information to chunks to improve search quality: title, summary, keywords, and questions that the chunk answers.
- Generate image descriptions: Add image description chunks for image search, see Images in response.
LLM settings
These settings apply when the LLM is used to:
- To form chunks if LLM-based chunking is selected.
- To enrich chunks, regardless of the chunking method.
These settings do not affect generating of image descriptions. The cloud version of Tovie Data Agent uses GPT-4o mini. If Tovie Data Agent is installed in your company’s infrastructure, it uses the model specified in its configuration.
Available settings:
- Model: Select one of the available language models.
- Max tokens in request: Limits the number of tokens that can be sent to the LLM.
- Max tokens in response: Limits the number of tokens that the LLM can generate in one iteration.
- Temperature: Adjusts the creativity level of responses. Higher temperature values produce more creative and less predictable results.
To see how your source is chunked, download the archive with chunks:
- Go to the Sources section and hover over the desired source.
- Click → Chunk archive.
When testing the knowledge base, you also can see which chunks are selected to generate the response.