elastictl: Import, export, re-shard and performance-test Elasticsearch indices

Jun 20 / 2021
3
Elasticsearch, Go, Golang

elastictl: Import, export, re-shard and performance-test Elasticsearch indices

For my work, I work a lot with Elasticsearch. Elasticsearch is pretty famous by now, so I doubt that it needs an introduction. But if you happen to not know what it is: it’s a document store with unique search capabilities, and incredible scalability.

Despite its incredible features though, it has its rough edges. And no, I don’t mean the horrific query language (honestly, who thought that was a good idea?). I mean the fact that without external tools it’s quite impossible to import, export, copy, move or re-shard an Elasticsearch index. Indices are very final, unfortunately.

This is quite often very inconvenient if you have a growing index for which each Elasticsearch shard is outgrowing its recommended size (2 billion documents). Or even if you have the opposite problem: if you have an ES cluster that has too many shards (~800 shards per host is the recommendation I think), because you have too many indices.

This is why I wrote elastictl: elastictl is a simple tool to import/export Elasticsearch indices into a file, and/or reshard an index. In this short post, I’ll show a few examples of how it can be used.

Usage

elastictl can be used for:

Backup/restore of an Elasticsearch index
Performance test an Elasticsearch cluster (import with high concurrency, see --workers)
Change the shard/replica count of an index (see elastictl reshard command)

It’s a tiny utility, so don’t expect too much, but it’s helped our work quite a bit. It allows you to easily copy an index, or move it, or test the index concurrency supported by your cluster. In my local cluster, I was able to import ~10k documents per second.

Here’s a short usage overview:

$ elastictl
NAME:
   elastictl - Elasticsearch toolkit

USAGE:
   elastictl COMMAND [OPTION..] [ARG..]

COMMANDS:
   export, e   Export an entire index to STDOUT
   import, i   Write to ES index from STDIN
   reshard, r  Reshard index using different shard/replica counts

Try 'elastictl COMMAND --help' for more information.

elastictl 0.0.5 (e645803), runtime go1.16, built at 2021-04-14T15:05:42Z
Copyright (C) 2021 Philipp C. Heckel, distributed under the Apache License 2.0

$ elastictl

NAME:

elastictl - Elasticsearch toolkit

USAGE:

elastictl COMMAND [OPTION..] [ARG..]

COMMANDS:

export, e Export an entire index to STDOUT

import, i Write to ES index from STDIN

reshard, r Reshard index using different shard/replica counts

Try 'elastictl COMMAND --help' for more information.

elastictl 0.0.5 (e645803), runtime go1.16, built at 2021-04-14T15:05:42Z

Export/dump an index to a file

To back up an index into a file, including its mapping and all the documents, you can use the elastictl export command. It will write out JSON to STDOUT. The file format is pretty simple: the first line of the output format is the mapping, the rest are the documents. You can even export only a subset of an index using the --search/-q option (that is, if you can master the query language).

# Entire index (assumes that ES is running at localhost:9200)
elastictl export my-index | gzip > my-index.json.gz

# Only a subset of documents
elastictl export \
  --host 10.0.1.2:9200 \
  --search '{"query":{"bool":{"must_not":{"match":{"eventType":"Success"}}}}}' \
  my-index > my-index-no-successes.json

# Entire index (assumes that ES is running at localhost:9200)

elastictl export my-index | gzip > my-index.json.gz

# Only a subset of documents

elastictl export \

--host 10.0.1.2:9200 \

--search '{"query":{"bool":{"must_not":{"match":{"eventType":"Success"}}}}}' \

my-index > my-index-no-successes.json

If you’re wondering “isn’t this just like elasticdump“? The answer is yes and no. I naturally tried elasticdump first, but it didn’t really work for me: I had issues installing it via npm, and it was quite frankly rather slow. elasticdump also doesn’t support resharding, though it has many other cool features.

Import to new index

The elastictl import command will read from STDIN and write a previously exported file to a new or existing index with configurable concurrency. Using a high number of --workers, you can really hammer the ES cluster. It’s actually quite easy to make even large clusters fall over like this (assuming of course that you’re pointing --host to a load balancer):

# With high concurrency
zcat my-index.json.gz | elastictl import --workers 100 my-index-copy

# Just copy and index
elastictl export my-index | elastictl import my-index-copy2

# With high concurrency

zcat my-index.json.gz | elastictl import --workers 100 my-index-copy

# Just copy and index

elastictl export my-index | elastictl import my-index-copy2

There are other options you can pass to the elastictl import command to modify the mapping slightly (mostly the number of replicas and the number of shards):

$ elastictl import --help
NAME:
   elastictl import - Write to ES index from STDIN

USAGE:
   elastictl import INDEX

OPTIONS:
   --host value, -H value      override default host (default: localhost:9200)
   --workers value, -w value   number of concurrent workers (default: 50)
   --shards value, -s value    override the number of shards on index creation (default: no change)
   --replicas value, -r value  override the number of replicas on index creation (default: no change)
   --no-create, -N             do not create index (default: false)
   --help, -h                  show help (default: false)

$ elastictl import --help

NAME:

elastictl import - Write to ES index from STDIN

USAGE:

elastictl import INDEX

OPTIONS:

--host value, -H value override default host (default: localhost:9200)

--workers value, -w value number of concurrent workers (default: 50)

--shards value, -s value override the number of shards on index creation (default: no change)

--replicas value, -r value override the number of replicas on index creation (default: no change)

--no-create, -N do not create index (default: false)

--help, -h show help (default: false)

Re-shard an index

The elastictl reshard command is a combination of the two above commands: it first exports an index into a file and then re-imports it with a different number of shards and/or replicas.

# Set number of shards of the "my-index" index to 10 and the number of replicas to 1
elastictl reshard \
  --shards 10 \
  --replicas 1 \
  my-index

# Export a subset of the "my-index" index and re-import it with a smaller number of shards/replicas
elastictl reshard \
  --search '{"query":{"bool":{"must_not":{"match":{"eventType":"Success"}}}}}' \
  --shards 1 \
  --replicas 1 \
  my-index

# Set number of shards of the "my-index" index to 10 and the number of replicas to 1

elastictl reshard \

--shards 10 \

--replicas 1 \

my-index

# Export a subset of the "my-index" index and re-import it with a smaller number of shards/replicas

elastictl reshard \

--search '{"query":{"bool":{"must_not":{"match":{"eventType":"Success"}}}}}' \

--shards 1 \

--replicas 1 \

my-index

Note: Similar to the _reindex API in Elasticsearch, this command should be used while the index is not being written to, because documents coming in after the command was kicked off will otherwise be lost. Please also note that the command does DELETE the index after exporting it. A copy will be available on disk though.

Feedback is welcome

elastictl is a tiny little tool, and I’m sure there are others that do a similar job. The tool is open source and available under the Apache 2.0 license, so please feel free to send contributions via pull a request on GitHub.

3 Comments

Coder December 29th, 2021

Good post
After this got stuck in FileBeat 401 unauthorization with aws Elasticsearch.
Got help from
https://learningsubway.com/filebeat-401-unauthorized-error-with-aws-elasticsearch/
Bene October 6th, 2022

As far as I can see, elastictl splits the docs for reimport and creates a single HTTP Request per Document.

Will you implement this as bulk, too? I’d guess, if you do so, you’ll easily exceed 10K Docs/s.
Bene October 6th, 2022

With these lines, I’m able to index ca. 10k Docs/s, but now the single python process is the bottleneck running at 100% CPU. I’m feeding it with a raw dump, which gets decompressed in on the fly by zstd.

But I think with go, this would not be any issue to make zstd run at 100% CPU ;-)

#!/usr/bin/env python3
from collections import deque from elasticsearch import Elasticsearch from elasticsearch import helpers import jsonlines import sys
es = Elasticsearch("") pb = helpers.parallel_bulk(es, actions=jsonlines.Reader(sys.stdin), chunk_size=5000, thread_count=8) deque(pb, maxlen=0)

elastictl: Import, export, re-shard and performance-test Elasticsearch indices

elastictl: Import, export, re-shard and performance-test Elasticsearch indices

Usage

Export/dump an index to a file

Import to new index

Re-shard an index

Feedback is welcome

3 Comments

Categories

Recent Comments