My name is Philipp C. Heckel and I write about nerdy things.

elastictl: Import, export, re-shard and performance-test Elasticsearch indices


Programming

elastictl: Import, export, re-shard and performance-test Elasticsearch indices


For my work, I work a lot with Elasticsearch. Elasticsearch is pretty famous by now, so I doubt that it needs an introduction. But if you happen to not know what it is: it’s a document store with unique search capabilities, and incredible scalability.

Despite its incredible features though, it has its rough edges. And no, I don’t mean the horrific query language (honestly, who thought that was a good idea?). I mean the fact that without external tools it’s quite impossible to import, export, copy, move or re-shard an Elasticsearch index. Indices are very final, unfortunately.

This is quite often very inconvenient if you have a growing index for which each Elasticsearch shard is outgrowing its recommended size (2 billion documents). Or even if you have the opposite problem: if you have an ES cluster that has too many shards (~800 shards per host is the recommendation I think), because you have too many indices.

This is why I wrote elastictl: elastictl is a simple tool to import/export Elasticsearch indices into a file, and/or reshard an index. In this short post, I’ll show a few examples of how it can be used.

Usage

elastictl can be used for:

  • Backup/restore of an Elasticsearch index
  • Performance test an Elasticsearch cluster (import with high concurrency, see --workers)
  • Change the shard/replica count of an index (see elastictl reshard command)

It’s a tiny utility, so don’t expect too much, but it’s helped our work quite a bit. It allows you to easily copy an index, or move it, or test the index concurrency supported by your cluster. In my local cluster, I was able to import ~10k documents per second.

Here’s a short usage overview:

Export/dump an index to a file

To back up an index into a file, including its mapping and all the documents, you can use the elastictl export command. It will write out JSON to STDOUT. The file format is pretty simple: the first line of the output format is the mapping, the rest are the documents. You can even export only a subset of an index using the --search/-q option (that is, if you can master the query language).

If you’re wondering “isn’t this just like elasticdump“? The answer is yes and no. I naturally tried elasticdump first, but it didn’t really work for me: I had issues installing it via npm, and it was quite frankly rather slow. elasticdump also doesn’t support resharding, though it has many other cool features.

Import to new index

The elastictl import command will read from STDIN and write a previously exported file to a new or existing index with configurable concurrency. Using a high number of --workers, you can really hammer the ES cluster. It’s actually quite easy to make even large clusters fall over like this (assuming of course that you’re pointing --host to a load balancer):

There are other options you can pass to the elastictl import command to modify the mapping slightly (mostly the number of replicas and the number of shards):

Re-shard an index

The elastictl reshard command is a combination of the two above commands: it first exports an index into a file and then re-imports it with a different number of shards and/or replicas.

Note: Similar to the _reindex API in Elasticsearch, this command should be used while the index is not being written to, because documents coming in after the command was kicked off will otherwise be lost. Please also note that the command does DELETE the index after exporting it. A copy will be available on disk though.

Feedback is welcome

elastictl is a tiny little tool, and I’m sure there are others that do a similar job. The tool is open source and available under the Apache 2.0 license, so please feel free to send contributions via pull a request on GitHub.

3 Comments

  1. Coder

    Good post
    After this got stuck in FileBeat 401 unauthorization with aws Elasticsearch.
    Got help from
    https://learningsubway.com/filebeat-401-unauthorized-error-with-aws-elasticsearch/


  2. Bene

    As far as I can see, elastictl splits the docs for reimport and creates a single HTTP Request per Document.

    Will you implement this as bulk, too? I’d guess, if you do so, you’ll easily exceed 10K Docs/s.


  3. Bene

    With these lines, I’m able to index ca. 10k Docs/s, but now the single python process is the bottleneck running at 100% CPU. I’m feeding it with a raw dump, which gets decompressed in on the fly by zstd.

    But I think with go, this would not be any issue to make zstd run at 100% CPU ;-)


    #!/usr/bin/env python3

    from collections import deque
    from elasticsearch import Elasticsearch
    from elasticsearch import helpers
    import jsonlines
    import sys

    es = Elasticsearch("")
    pb = helpers.parallel_bulk(es, actions=jsonlines.Reader(sys.stdin), chunk_size=5000, thread_count=8)
    deque(pb, maxlen=0)