Section and index settings

release/20240425
English

Section and index settings

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in Classic and VPC

You must define sections and indexes of the domain according to the type of documents to be searched and the purpose of the search. Creating an index that is suitable for the way you want to serve it will result in a faster response.
<Example> When the target document for search consists of product name, price, product description, number of recommendations, and tags

Section: defines the same as the constituent items of the document (product name, price, product description, number of recommendations, and tags)
Index: organizes by selecting sections to suit your search purpose
- If you want to search by product name only: select only the section corresponding to the product name
- If you want to search by product name, price, and so on: select multiple sections corresponding to the items you want to search for

Setting up sections and indexes explains how to add sections to your domain and how to add and edit indexes. In addition, it describes items for setting index and how to index data in List type.

Add section

The following shows how to add sections to a domain.

Click the environment you are using in the Region menu and Platform menu of the NAVER Cloud Platform console.
Click the Services > Big Data & Analytics > Cloud Search menu, in that order.
Click the Domain menu.
Select the domain to add a section to from the domain list and click the [Index settings] button.
Click the [Section] tab.
After entering the section name and ranking variable (data type), click the [Add] button.
- You cannot edit or delete sections after saving the section.
- Section names can only contain English letters, numbers, and underscores (_).
- You must configure sections identically to the schema of the documents to be searched for upload to the domain.
- : delete the added section
Click the [Save] button after adding the section.

Note

After changing section and index, you must update the autocomplete index manually. (see Update autocomplete index)
A ranking variable is a variable that represents a property of a section, not a data type.
- When searching documents, you can use ranking variables (section attributes) to sort, summarize, and limit searches (scope, user_scope).
- Do not add ranking variables unless you are sorting, summarizing, or limiting searches by attributes in that section.
- Adding unnecessary ranking variables will use extra space or be time-consuming when uploading documents and editing indexes.

Change index

The following is how to add and edit the index of an existing domain.

Click the environment you are using in the Region menu and Platform menu of the NAVER Cloud Platform console.
Click the Services > Big Data & Analytics > Cloud Search menu, in that order.
Click the Domain menu.
Select the domain to change the index from the domain list and click the [Index settings] button.
Click the [Index] tab.
After clicking the [Add] button or the [Edit] button, add or edit the index in the pop-up window that appears, and click the [Save] button.
- Items to set index
  - Index name: can only enter letters, numbers, and underscores (_)
  - Enable term location: select whether to record the position information of indexed terms (see Enable term location)
  - Document weight function: a function to aggregate section weights calculated per build (see Document weight function)
    - Sum: select the sum_wgt function (calculate the sum of the weights)
    - Max: select the max_wgt function (calculate the maximum value of the weights)
  - Build information
    - Name: enter a build name
    - Section weight: enter the weight for the section to be applied when calculating the ranking score (see Section weight)
    - Creating target (section): select a section to apply the build to
    - Analysis options: select analyzers and synonym dictionaries for document indexing (see Analysis option)
- You cannot delete the index after saving it.
- [Delete] button: select the index you added and click to delete the index
  - You cannot delete the existing index.
Click the [Save] button after completing to add and edit the index.

Note

After changing section and index, you must update the autocomplete index manually. (see Update autocomplete index)

Items to set index

You can set the index according to your search purpose. You can perform detailed indexing with various analyzers, including a language analyzer, and add multiple builds to specify different section weights and indexing options for each section.

Enable term location

When the option to enable term position is applied, it saves the position information of the indexed term.

Caution

You cannot change the total number of analyzers created when editing an index if applying the option to enable term position. (The array size of buildInfos must be the same.)

By storing the location information of the term, you can perform an exact search by specifying the location of the indexed term.

Exact search: searches for documents where the position and distance of multiple index terms match the search term

<Example> Index applying the option to enable term location

{
  "indexes": [
    {
      "name": "atomic_idx",
      "createTermLoc": true,
      "documentTermWeight": "sum_wgt",
      "buildInfos": [
        {
          "name": "index_build",
          "sectionTermWeight": "1 * stw_2p(tf, 0.5, 0.25, 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "atomic",
              "method": "atomic",
              "option": ""
            }
          ],
          "sections": [
            "description"
          ]
        }
      ]
    }
  ]
}

<Example> Exact search query

{
    "search": {
        "atomic_idx": {
            "main": {
                "query": "Where are Hyundai's automobiles produced",
                "term_distance": 1,
                "option": "within"
            }
        }
    }
}

Document weight function

The document weight function is used to calculate the document weight (qds, similarity of query document) of search results. The document weight is calculated using the section weight according to the document weight function, and the search results are sorted in ascending order of document weight unless otherwise specified. As the document weight function, you can select one of the following:

Sum: select the sum_wgt function (calculate the sum of the weights)
Max: select the max_wgt function (calculate the maximum value of the weights)

Note

At searching, the ranking formula allows the document weight to be weighted once more to calculate the ranking score (_relevance) (see Ranking formula).

<Example> Calculate ranking score by weighting qds for each query (q1, q2)

_relevance = qry_qds("q1") * 0.7 + qry_qds("q2") * 0.3

Section weight

Section weight is a value that indicates the importance of the section information. The basic formula for section weight and descriptions of each function and variable are as follows.

imp*stw_2p(tf, A, B, C, dlen)

stw_2p: a function of search system model (2 Poisson model)
tf: the number of times the search query appeared in the section (term frequency)
- Typing min(tf, n) limits the number of occurrences of the overly repeated word to n.
A: specifies a value between 0 and 1, and the closer to 1, the greater the penalty received when the length of the section content is long
B: specifies a value greater than 0, and the difference in scores between documents with low tf and documents with high tf increases as the value increases
C: specifies a value between 0 and 1, and the closer to 1, the lower the difference in section weight between documents
- If C is 0.0, it represents the section weights on a scale of 0 to 1.
- If C is 0.5, it represents the section weights on a scale of 0.5 to 1.
- If C is 1.0, all documents have the same section weight of 1 point.
dlen: the length of the document
imp: importance of section

Note

The default value 1.0 * stw_2p(tf, 0.5, 0.25, 0., length / 128.0) of section weight is the value set based on shopping mall search.

Assuming that the product name does not exceed 128 characters, set the effect of the A value to be small
Set the value of B small to have a small effect on the number of times the query appears in a section.
- Including at least one word indexed by the analyzer will increase the ranking, and having more than one will not significantly increase the ranking.
- For normal post searches (title+content), use 2.0 as the value for B.
Set the value of dlen to the document length divided by 128.0 (length/128.0), and the standard for penalty by A value to 128 characters.

Analysis option

As an analysis option, you can select an index-based analyzer and thesaurus. Cloud Search provides a morpheme analyzer called hanaterm, and indexing methods with atomic and sgmt.

Note

To select a thesaurus as an analysis option, you must first create a thesaurus. (For how to create a thesaurus, see Create thesaurus)
You can set the atomic indexing method only through the API. (For indexing methods through the API, see the Cloud Search API guide )

The analysis options available for each indexing method are as follows.

atomic
- Option not set: indexes a space-delimited string
- oneterm: indexes the entire input string into one
sgmt
- Language options: selecting a language will also apply analysis options appropriate to that language
  - Korean: korean (+korea +josacat +eomicat)
  - English: english (+english +revert)
  - Japanese: japanese (+japan +josacat +eomicat)
  - Chinese (Simplified): chinese (+china_cn)
  - Chinese (Traditional): taiwanese (+china_tw)
  - Thai: thai (+thai)
  - Indonesian: indonesian (+indonesian)
- Morpheme option (Korean)
  - Proposition: +josacat
  - Suffix: +eomicat
  - Roots of derived nouns: +nounstem
  - Compound nouns: +compsub
  - Three-letter compound nouns: +compnoun3
- Common options for non-Korean language
  - Prototype of the verb: +revert
- Common options
  - Total word index: +word
  - Split by character type: +token-all
  - Split merged tokens (not available with +token-all): +alphanum

Index example

The following explains an example of the index and search results assuming that you are searching for products in a shopping mall. Assume that the product information of the shopping mall is as shown in the following table.

category	brand	item	description
TV	LG	55-inch TV	Description
TV	LG	65-inch	Description
TV	Samsung	65-inch	Description
TV	Samsung	75NTAWE3	Description
Peripherals for TV	NAVER	LG TV stand	Description
Peripherals for TV	NAVER	Stand	Description

In this case, you can index it as follows.

This is an example of adding thesaurus syno_dic to the analysis options.

{
  "indexes": [
    {
      "name": "shopping",
      "createTermLoc": true,
      "documentTermWeight": "sum_wgt",
      "buildInfos": [
        {
          "name": "exact_oneterm",
          "sectionTermWeight": "0.5 * stw_2p(min(tf, 1), 0.25, 0., 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "hanaterm",
              "method": "atomic",
              "option": "oneterm"
            },
            {
              "type": "add-normalized-synonym",
              "dictName": "syno_dic",
              "maxSynoNum": -1
            }
          ],
          "sections": [
            "category"
          ]
        },
        {
          "name": "category_sgmt",
          "sectionTermWeight": "0.10 * stw_2p(min(tf, 1), 0.25, 0., 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "hanaterm",
              "method": "sgmt",
              "option": "+korea +josacat +eomicat +syno=syno_dic"
            },
            {
              "type": "add-normalized-synonym",
              "dictName": "syno_dic",
              "maxSynoNum": -1
            }
          ],
          "sections": [
            "category"
          ]
        },
        {
          "name": "brand_item_atomic",
          "sectionTermWeight": "0.05 * stw_2p(min(tf, 1), 0.25, 0., 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "hanaterm",
              "method": "atomic",
              "option": ""
            },
            {
              "type": "add-normalized-synonym",
              "dictName": "syno_dic",
              "maxSynoNum": -1
            }
          ],
          "sections": [
            "brand",
            "item"
          ]
        },
        {
          "name": "brand_item_name_sgmt",
          "sectionTermWeight": "0.05 * stw_2p(min(tf, 1), 0.25, 0., 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "hanaterm",
              "method": "sgmt",
              "option": "+english +revert +korea +josacat +eomicat +syno=syno_dic"
            },
            {
              "type": "add-normalized-synonym",
              "dictName": "syno_dic",
              "maxSynoNum": -1
            }
          ],
          "sections": [
            "brand",
            "item"
          ]
        },
        {
          "name": "description_sgmt",
          "sectionTermWeight": "0.03 * stw_2p(tf, 0.5, 0.25, 0., length / 128.0)",
          "indexProcessors": [
            {
              "type": "hanaterm",
              "method": "sgmt",
              "option": "+english +revert +korea +josacat +eomicat +syno=syno_dic"
            },
            {
              "type": "add-normalized-synonym",
              "dictName": "syno_dic",
              "maxSynoNum": -1
            }
          ],
          "sections": [
            "description"
          ]
        }
      ]
    }
  ]
}

If you search for the search term "LG TV", your search will be processed as follows.

Documents with a category of TV have the highest weight because they atomically match the search term.
LG 55-inch TVs have more weight when matching search terms in brand and item.
The LG 65-inch matches the search term and category, but ranks low by brand_item_atomic build.
TV stands are ranked low because the search terms do not match the category.

Data index in list type

The search engine provided by Cloud Search does not support the list type of JSON. However, if you set the analysis option as follows, you can bypass and use data in list type.

Note

The console does not support it and you can only set it using API. (For indexing methods through the API, see the Cloud Search API guide )

<Example> Specify analysis options for a section called tag

{
"indexProcessors": [
    {
        "distance": 1000,
        "delimiter": "\u001f",
        "type": "tokenize"
    },
    {
        "type": "hanaterm",
        "method": "atomic",
        "option": "+atomic"
    }
  ]
}

It saves the data entered in the JSON List type through the analysis option with a delimiter so the search engine can use it.
- If you enter the information ["Information", "Search"] in the tag section, it will be indexed after being converted to "Information\u001fSearch" through the analysis option, and tokenized as "Information", "Search" and used for search.
Tokenized tags have an arbitrary distance between terms (<example> 1000) to avoid searching for tags located between other tags. If you perform a within search by specifying the distance between these terms as an option, you can search without including tags between tokenized tags. (For details on within search, see search)

Was this article helpful?

What's Next

Test search and search settings

Table of contents

Add section
Change index
Items to set index
Index example
Data index in list type