Hive database replication

Script to replicate a single Hive database from an HDP cluster to another. Utilizes the HIVE REPL features used under the hood in DLM. Supports replication of managed, external tables and views. Try running the manual steps in the ManualSteps.md document before running the script. For instructions, see the section below. Tested on HDP 3.1.4. but it is recommended to use HDP 3.1.5 or CDP versions for mature Hive REPL implementation.

Files :

env.sh : Configure environment variables here before running the script.
hive3repl.sh : Main script to invoke. Run as shown later in the doc.
beeline_functions.sh : Functions that make beeline calls are defined here.
init_variables.sh : All global variables are initialized here. Do not change anything here.
misc_functions.sh : Some miscellaneous functions for logging etc are defined here.

Configs

Parameter	Description
TARGET_JDBC_URL	JDBC URL for target cluster. Copy this value from the Ambari UI.
SOURCE_JDBC_URL	JDBC URL for source cluster. Copy this value from the Ambari UI.
INCLUDE_EXTERNAL_TABLES	true/false
REPL_ROOT	location in source hdfs where dump data will be written. This is used only to verify REPL DUMP output starting suffix
SOURCE_HDFS_PREFIX	Prefix to access HDFS locations at source cluster as accessed from target. Can use the Namenode IP:port or cluster nameservice id. Eg. `hdfs://c2186-node2.coelab.cloudera.com:8020` or `hdfs://c2186`"
BEELINE_USER	User running beeline. In kerberized environments this may be ignored.
INCR_RERUN	Number of attempts to retry a failed incremental repl load
RERUN_SLEEP	Sleep interval in seconds between retry attempts
TMP_DIR	Directory to store temporary files used for parsing beeline output. Default: ./tmp
LOG_DIR	Directory to write script logs. Default: ./logs

Workflow

The workflow is as follows:

Takes in a database name alongside other env variables.
Env variables are picked up from configuration file.
Looks up the DR Hive instance for the current replication status of that database.
If it’s not present, it goes to the Production hive instance and generates a full dump of the database.
- The dump returns an HDFS path of the dump and the latest transaction id at Production.
- Once the production dump is generated, it connects back to DR hive instance and passes the HDFS location to the DR instance to replicate the database.
- The events are then replayed on the DR replica database instance.
- Post replay, the replication status is looked up again in DR.
- Based on a comparison of the Production transaction id and post replay transaction id at DR, displays success or failure message.
If an existing version of database exists in DR, it connects to Production Hive instance and passes the last replicated id in DR to production
- The production Hive instance then generates an incremental dump of the database and returns an HDFS path of the dump and the latest transaction id at Production.
- Once the production dump is generated, it connects back to the DR hive instance and the events are replayed on the DR replica database instance.
- If the replay fails, it is retried upto ${INCR_RERUN} times.
- Post replay, the replication status is looked up again in DR,
- Based on a comparison of the Production transaction id and post replay transaction id at DR, displays success or failure message.
All messages are printed both to console and a log file.

Instructions to run

Update configurations in env.sh
Ensure source database has the repl.source.for variable set. See ManualSteps.md for more details.
Ensure target cluster has the database created
If the cluster is kereberized, obtain the kerberos tickets.
Launch the script with the database name as argument.
Track the script progress in the log filename which is printed in the console for each run.
Upon successfull completion, the updated transaction id at the target database will be displayed.

Sample run

First time - FULL DUMP

[hive@c4186-node3 Hive_acid_table_replication]$ bash hive3repl.sh repltest 
2020-12-28 08:28:00.678 Initiating run to replicate repltest to repltest
2020-12-28 08:28:05.852 No replication id detected at target. Full data dump dump needs to be initiated.
2020-12-28 08:28:05.857 Database repltest is being synced for the first time. Initiating full dump.
2020-12-28 08:28:05.861 Skipping external tables in full dump
2020-12-28 08:28:11.831 Source transaction id: |1029|
2020-12-28 08:28:11.837 Database repltest full dump has been generated at |hdfs://c2186-node2.coelab.cloudera.com:8020/apps/hive/repl/5d117227-38c4-4b1c-826b-e3222b9dfbc3|.
2020-12-28 08:28:11.843 The current transaction ID at source is |1029|
2020-12-28 08:28:11.847 There are 1029 transactions to be synced in this run.
2020-12-28 08:28:11.851 Initiating data load at target cluster on database repltest.
2020-12-28 08:28:11.856 External tables not included.
2020-12-28 08:28:31.926 Data load at target cluster completed. Verifying....
2020-12-28 08:28:37.368 Database replication completed SUCCESSFULLY. Last transaction id at target is |1029|

INCREMENTAL DUMP

[hive@c4186-node3 Hive_acid_table_replication]$ bash hive3repl.sh repltest 
2020-12-28 08:29:01.297 Initiating run to replicate repltest to repltest
2020-12-28 08:29:06.402 Database repltest transaction ID at target is currently |1029|
2020-12-28 08:29:06.407 Skipping external tables in incremental dump
2020-12-28 08:29:22.185 The current transaction ID at source is |1034|
2020-12-28 08:29:22.190 Database repltest incremental dump has been generated at |hdfs://c2186-node2.coelab.cloudera.com:8020/apps/hive/repl/276b8090-7ddf-43ed-929b-6adc40f42249|.
2020-12-28 08:29:22.196 There are 5 transactions to be synced in this run.
2020-12-28 08:29:22.201 External tables not included.
2020-12-28 08:29:41.424 Data load at target cluster completed. Verifying....
2020-12-28 08:29:46.790 Database replication completed SUCCESSFULLY. Last transaction id at target is |1034|

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hive database replication

Files :

Configs

Workflow

Instructions to run

Sample run

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 308 Commits
HQL		HQL
logs		logs
run		run
tmp		tmp
ManualSteps.md		ManualSteps.md
README.md		README.md
TODO.md		TODO.md
beeline_functions.sh		beeline_functions.sh
env.sh		env.sh
hive3repl.sh		hive3repl.sh
init_variables.sh		init_variables.sh
misc_functions.sh		misc_functions.sh

kovjanos/Hive3_replication

Folders and files

Latest commit

History

Repository files navigation

Hive database replication

Files :

Configs

Workflow

Instructions to run

Sample run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages