# gitee-esm **Repository Path**: fishjam/gitee-esm ## Basic Information - **Project Name**: gitee-esm - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-03-28 - **Last Updated**: 2024-03-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # An Elasticsearch Migration Tool Elasticsearch cross version data migration. Links: - [Dec 3rd, 2020: [EN] Cross version Elasticsearch data migration with ESM](https://discuss.elastic.co/t/dec-3rd-2020-en-cross-version-elasticsearch-data-migration-with-esm/256516) - [Use INFINI Gateway to check the Document-Level differences between two clusters or indices after the migration](https://gateway.infinilabs.com/docs/tutorial/index_diff/) ## Features: * Cross version migration supported * Overwrite index name * Copy index settings and mapping * Support http basic auth * Support dump index to local file * Support loading index from local file * Support http proxy * Support sliced scroll ( elasticsearch 5.0 +) * Support run in background * Generate testing data by randomize the source document id * Support rename filed name * Support unify document type name * Support specify which _source fields to return from source * Support specify query string query to filter the data source * Support rename source fields while do bulk indexing * Support incremental update(add/update/delete changed records) with `--sync`. Notice: it use different implementation, just handle the ***changed*** records, but not as fast as the old way * Load generating with ## ESM is fast! A 3 nodes cluster(3 * c5d.4xlarge, 16C,32GB,10Gbps) ``` root@ip-172-31-13-181:/tmp# ./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 -w 40 --sliced_scroll_size=60 -b 5 --buffer_count=2000000 --regenerate_id [12-19 06:31:20] [INF] [main.go:506,main] start data migration.. Scroll 10064570 / 10064570 [=================================================] 100.00% 55s Bulk 10062602 / 10064570 [==================================================] 99.98% 55s [12-19 06:32:15] [INF] [main.go:537,main] data migration finished. ``` Migrated 10,000,000 documents within a minute, Nginx log generated from kibana_sample_data_logs. ## Before ESM Before running the esm, please manually prepare the target index with mapping and optimized settings to improve the speed, for example: ``` PUT your-new-index { "settings": { "index.translog.durability": "async", "refresh_interval": "-1", "number_of_shards": 10, "number_of_replicas": 0 } } ``` ## Example: copy index `index_name` from `192.168.1.x` to `192.168.1.y:9200` ``` ./bin/esm -s http://192.168.1.x:9200 -d http://192.168.1.y:9200 -x index_name -w=5 -b=10 -c 10000 ``` copy index `src_index` from `192.168.1.x` to `192.168.1.y:9200` and save with `dest_index` ``` ./bin/esm -s http://localhost:9200 -d http://localhost:9200 -x src_index -y dest_index -w=5 -b=100 ``` use sync feature for incremental update index `src_index` from `192.168.1.x` to `192.168.1.y:9200` ``` ./bin/esm --sync -s http://localhost:9200 -d http://localhost:9200 -x src_index -y dest_index ``` support Basic-Auth ``` ./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index" -d http://localhost:9201 -n admin:111111 ``` copy settings and override shard size ``` ./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index" -d http://localhost:9201 -m admin:111111 -c 10000 --shards=50 --copy_settings ``` copy settings and mapping, recreate target index, add query to source fetch, refresh after migration ``` ./bin/esm -s http://localhost:9200 -x "src_index" -q=query:phone -y "dest_index" -d http://localhost:9201 -c 10000 --shards=5 --copy_settings --copy_mappings --force --refresh ``` dump elasticsearch documents into local file ``` ./bin/esm -s http://localhost:9200 -x "src_index" -m admin:111111 -c 5000 -q=query:mixer --refresh -o=dump.bin ``` dump source and target index to local file and compare them, so can find the difference quickly ``` ./bin/esm --sort=_id -s http://localhost:9200 -x "src_index" --truncate_output --skip=_index -o=src.json ./bin/esm --sort=_id -s http://localhost:9200 -x "dst_index" --truncate_output --skip=_index -o=dst.json diff -W 200 -ry --suppress-common-lines src.json dst.json ``` loading data from dump files, bulk insert to another es instance ``` ./bin/esm -d http://localhost:9200 -y "dest_index" -n admin:111111 -c 5000 -b 5 --refresh -i=dump.bin ``` support proxy ``` ./bin/esm -d http://123345.ap-northeast-1.aws.found.io:9200 -y "dest_index" -n admin:111111 -c 5000 -b 1 --refresh -i dump.bin --dest_proxy=http://127.0.0.1:9743 ``` use sliced scroll(only available in elasticsearch v5) to speed scroll, and update shard number ``` ./bin/esm -s=http://192.168.3.206:9200 -d=http://localhost:9200 -n=elastic:changeme -f --copy_settings --copy_mappings -x=bestbuykaggle --sliced_scroll_size=5 --shards=50 --refresh ``` migrate 5.x to 6.x and unify all the types to `doc` ``` ./esm -s http://source_es:9200 -x "source_index*" -u "doc" -w 10 -b 10 - -t "10m" -d https://target_es:9200 -m elastic:passwd -n elastic:passwd -c 5000 ``` to migrate version 7.x and you may need to rename `_type` to `_doc` ``` ./esm -s http://localhost:9201 -x "source" -y "target" -d https://localhost:9200 --rename="_type:type,age:myage" -u"_doc" ``` filter migration with range query ``` ./esm -s https://192.168.3.98:9200 -m elastic:password -o json.out -x kibana_sample_data_ecommerce -q "order_date:[2020-02-01T21:59:02+00:00 TO 2020-03-01T21:59:02+00:00]" ``` range query, keyword type and escape ``` ./esm -s https://192.168.3.98:9200 -m test:123 -o 1.txt -x test1 -q "@timestamp.keyword:[\"2021-01-17 03:41:20\" TO \"2021-03-17 03:41:20\"]" ``` generate testing data, if `input.json` contains 10 documents, the follow command will ingest 100 documents, good for testing ``` ./bin/esm -i input.json -d http://localhost:9201 -y target-index1 --regenerate_id --repeat_times=10 ``` select source fields ``` ./bin/esm -s http://localhost:9201 -x my_index -o dump.json --fields=author,title ``` rename fields while do bulk indexing ``` ./bin/esm -i dump.json -d http://localhost:9201 -y target-index41 --rename=title:newtitle ``` user buffer_count to control memory used by ESM, and use gzip to compress network traffic ``` ./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 --regenerate_id -w 20 --sliced_scroll_size=60 -b 5 --buffer_count=1000000 --compress false ``` ## Download https://github.com/medcl/esm/releases ## Compile: if download version is not fill you environment,you may try to compile it yourself. `go` required. `make build` * go version >= 1.7 ## Options ``` Usage: esm [OPTIONS] Application Options: -s, --source= source elasticsearch instance, ie: http://localhost:9200 -q, --query= query against source elasticsearch instance, filter data before migrate, ie: name:medcl --sort= sort field when scroll, ie: _id (default: _id) -d, --dest= destination elasticsearch instance, ie: http://localhost:9201 -m, --source_auth= basic auth of source elasticsearch instance, ie: user:pass -n, --dest_auth= basic auth of target elasticsearch instance, ie: user:pass -c, --count= number of documents at a time: ie "size" in the scroll request (10000) --buffer_count= number of buffered documents in memory (100000) -w, --workers= concurrency number for bulk workers (1) -b, --bulk_size= bulk size in MB (5) -t, --time= scroll time (1m) --sliced_scroll_size= size of sliced scroll, to make it work, the size should be > 1 (1) -f, --force delete destination index before copying -a, --all copy indexes starting with . and _ --copy_settings copy index settings from source --copy_mappings copy index mappings from source --shards= set a number of shards on newly created indexes -x, --src_indexes= indexes name to copy,support regex and comma separated list (_all) -y, --dest_index= indexes name to save, allow only one indexname, original indexname will be used if not specified -u, --type_override= override type name --green wait for both hosts cluster status to be green before dump. otherwise yellow is okay -v, --log= setting log level,options:trace,debug,info,warn,error (INFO) -o, --output_file= output documents of source index into local file --truncate_output= truncate before dump to output file -i, --input_file= indexing from local dump file --input_file_type= the data type of input file, options: dump, json_line, json_array, log_line (dump) --source_proxy= set proxy to source http connections, ie: http://127.0.0.1:8080 --dest_proxy= set proxy to target http connections, ie: http://127.0.0.1:8080 --refresh refresh after migration finished --sync= sync will use scroll for both source and target index, compare the data and sync(index/update/delete) --fields= filter source fields(white list), comma separated, ie: col1,col2,col3,... --skip= skip source fields(black list), comma separated, ie: col1,col2,col3,... --rename= rename source fields, comma separated, ie: _type:type, name:myname -l, --logstash_endpoint= target logstash tcp endpoint, ie: 127.0.0.1:5055 --secured_logstash_endpoint target logstash tcp endpoint was secured by TLS --repeat_times= repeat the data from source N times to dest output, use align with parameter regenerate_id to amplify the data size -r, --regenerate_id regenerate id for documents, this will override the exist document id in data source --compress use gzip to compress traffic -p, --sleep= sleep N seconds after finished a bulk request (-1) Help Options: -h, --help Show this help message ``` ## FAQ - Scroll ID too long, update `elasticsearch.yml` on source cluster. ``` http.max_header_size: 16k http.max_initial_line_length: 8k ``` Versions -------- From | To -----------|----------- 1.x | 1.x 1.x | 2.x 1.x | 5.x 1.x | 6.x 1.x | 7.x 2.x | 1.x 2.x | 2.x 2.x | 5.x 2.x | 6.x 2.x | 7.x 5.x | 1.x 5.x | 2.x 5.x | 5.x 5.x | 6.x 5.x | 7.x 6.x | 1.x 6.x | 2.x 6.x | 5.0 6.x | 6.x 6.x | 7.x 7.x | 1.x 7.x | 2.x 7.x | 5.x 7.x | 6.x 7.x | 7.x