Skip to main content

Indexing settings

Indexing the data loaded into the knowledge base consists of several stages:

  1. Data processing: converting text into MD format, which is used for training.
  2. Chunking: dividing text into fragments (chunks).
  3. Vectorisation: converting the chunks into vector representations (embeddings).

In the Project settingsIndexing tab, you can change the parameters for chunking and vectorisation.

caution

After changing the indexing settings, you’ll need to re-index the knowledge base.

Vectorisation

The Vectoriser model parameter determines the language model for text vectorisation. This model will vectorise both your data and user queries:

  • text-embedding-3-large: a model by OpenAI. When using the model, your data is sent to their server.
  • intfloat/multilingual-e5-large: a model hosted on Tovie AI’s servers.

Chunking

Chunking method

The Chunking method parameter determines how the text will be split into chunks:

  • By length: the text will be chunked by length, considering word boundaries.
  • Using LLM: the text will be chunked using a language model. In this case, the chunks will consider the text hierarchy: headings, paragraphs, section and document titles.

The list of settings depends on the selected chunking method.

  • Max chunk size in characters.

    How the text will be chunked
      Suppose the *Max chunk size in characters* setting is 70.
    You have a text consisting of 2 sentences, 100 characters each.

    The text will be divided into 3 chunks:

    1. 70 characters from the first sentence.
    2. The remaining 30 characters from the first sentence and 40 from the second one.
    3. The remaining 60 characters from the second sentence.
  • Language: the language of the source documents. This setting helps chunk the text correctly. If your sources are in several languages, select the one most used in queries to the knowledge base.

Data preparation using LLM

caution

Enabling these options may significantly increase your costs.

  • Enrich chunks: add additional information to chunks to improve search quality: title, summary, keywords, and questions the chunk answers.
  • Generate image descriptions: add image description chunks for image search, see Images in response.

LLM settings

The LLM settings are used to obtain image descriptions and, in the case of LLM-based chunking, to generate chunks as well.

The LLM settings are applied:

  • To form chunks if LLM-based chunking is selected.
  • To enrich chunks, regardless of the chunking method.
caution

These settings do not affect generating of image descriptions. The cloud version of Tovie Data Agent uses GPT-4o mini. If Tovie Data Agent is installed in your company’s infrastructure, it uses the model specified in its configuration.

Available settings:

  • Model: select one of the available language models.
  • Max tokens in request: limits the number of tokens that can be sent to the LLM.
  • Max tokens in response: limits the number of tokens that the LLM can generate in one iteration.
  • Temperature adjusts the creativity level of responses. Higher temperature values produce more creative and less predictable results.
tip

To see how your source is chunked, download the archive with chunks:

  1. Go to the Sources section and hover over the desired source.
  2. Click Chunk archive.

When testing the knowledge base, you also can see which chunks are selected to generate the response.