- Elasticsearch Server: Second Edition
- Rafa? Ku? Marek Rogoziński
- 1284字
- 2021-04-09 23:36:04
Elasticsearch indexing
We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like.
Shards and replicas
As you recollect from the previous chapter, the Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml
) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster.
Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows:
- More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance.
- More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards.
- More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive.
- Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well.
- The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query.
Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. We will talk about most of them later in the book.
So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data.
Creating indices
When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command:
curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New version of Elasticsearch released!", "content": "...", "tags": ["announce", "elasticsearch", "release"] }'
This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command:
curl -XPUT http://localhost:9200/blog/
We just told Elasticsearch that we want to create the index with the blog
name. If everything goes right, you will see the following response from Elasticsearch:
{"acknowledged":true}
When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards.
Sometimes, you can come to the conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml
configuration file:
action.auto_create_index: false
Note
Note that action.auto_create_index
is more complex than it looks. The value can be set to not only false
or true
. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a
, but disallows the creation of indices starting with an
. The other indices aren't allowed and must be created manually (because of -*
).
action.auto_create_index: -an*,+a*,-*
Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an*
to the end, it won't be used because of +a*
, which will be checked first.
The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example:
curl -XPUT http://localhost:9200/blog/ -d '{"settings" : { "number_of_shards" : 1, "number_of_replicas" : 2 } }'
The preceding command will result in the creation of the blog
index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way; we will talk about those later in the book.
So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT
HTTP method, we use DELETE
. So the actual command is as follows:
curl –XDELETE http://localhost:9200/posts
And the response will be the same as the one we saw earlier, as follows:
{"acknowledged":true}
Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched.