On this weblog, we discover how one can seamlessly improve your Hive metastore* schemas and exterior tables to the Unity Catalog metastore utilizing the brand new SYNC command. SYNC command will also be used to push updates from the supply schemas and exterior tables in Hive metastore to the Unity Catalog metastore schemas and tables, which have been beforehand upgraded. SYNC command is presently in public preview on AWS and Azure.
*Observe: Hive metastore could possibly be your default or exterior metastore and even AWS Glue metastore. For simplicity, we have now used the time period “Hive metastore” all through this doc
Frequent use circumstances for upgrading and syncing your Hive Metastore to Unity Catalog
Unity Catalog, now typically out there on AWS and Azure, presents a mess of out-of-box centralized governance options reminiscent of unified entry and audit controls for all information property in your information Lakehouse, automated information lineage for all workloads, built-in information search and discovery, privilege inheritance for simplified entry administration, open cross-platform information sharing, and lots of extra.
One of many frequent questions that come to thoughts is how one can simply improve or migrate your tables and schemas registered within the present Hive metastore to the Unity Catalog metastore and hold Unity Catalog in sync with the Hive metastore. Whilst you would wish to reap the benefits of all of the wealthy options Unity Catalog has to supply, there may be varied situations the place you want the Hive metastore objects to co-exist even after migrating the objects to the Unity Catalog metastore. For instance, you may need an ETL pipeline that writes information to tables saved in Hive metastore and it’s essential to carry out an in depth influence evaluation earlier than steadily migrating the tables to the Unity Catalog metastore. Till such time, it’s essential to hold your Hive metastore and the unity catalog metastore in sync.
Listed below are the frequent questions we heard from our clients:
- How do you migrate our information workloads from two-level namespaces (Schema and tables/views) to the Unity Catalog’s 3-level namespaces (Catalog, Schema, Tables/Views)?
- Do it’s essential to copy information from the prevailing location to a brand new location for the desk within the unity catalog metastore or simply must create a brand new schema and desk within the unity catalog metastore and level to the prevailing location?
- How can we preserve entry to Hive metastore tables whereas starting to leverage Unity Catalog, and hold modifications to the schema in sync?
- Can we have now an evaluation on what steps can be required to maneuver our HMS objects to Unity Catalog metastore?
Introducing SYNC Command in Unity Catalog
To facilitate the seamless migration of your schemas and exterior tables out of your present Hive metastore to the Unity Catalog metastore, we have now launched a utility referred to as SYNC. SYNC command helps you migrate your present Hive metastore to the Unity Catalog metastore and likewise helps to maintain each your metastores in sync on an ongoing foundation till you fully migrate all of your dependent functions from Hive metastore to the Unity Catalog metastore. As a substitute of allocating sources to construct a customized resolution, SYNC offers you with a simple out of the field resolution to maintain your present Hive metastore and the Unity Catalog metastore in sync.
Key options of SYNC
- Potential to improve an exterior desk from Hive metastore to the Unity Catalog metastore and hold metadata of the 2 tables in sync.
- Potential to improve all eligible tables in Hive metastore schema to the Unity Catalog metastore and hold the metadata in sync. It makes use of multithreading whereas upgrading a number of tables in parallel
- Dry run mode to show the results of the SYNC command with out creating or updating the goal tables.
- Potential to run SYNC a number of instances on the identical schema or tables to maintain the supply and goal metastore in sync.
How does it work
The SYNC command abstracts all of the complexities of migrating a schema and exterior tables from the Hive metastore to the Unity Catalog metastore and retaining them in sync. As soon as executed, it analyses the supply and goal tables or schemas and performs the beneath operations:
- If the goal desk doesn’t exist, the sync operation creates a goal desk with the identical title because the supply desk within the supplied goal schema. The proprietor of the goal desk will default to the consumer who’s operating the SYNC command
- If the goal desk exists, and if the desk is decided to be created by a earlier SYNC command or upgraded through Internet Interface, the sync operation will replace the desk such that its schema matches with the schema of the supply desk.
The command outputs one row per desk which is upgraded and features a status_code and outline column. The status_code column signifies the standing of the improve for that desk and the outline offers a descriptive message for every desk.
Getting began with the SYNC command
The customers operating the sync command ought to:
- Be the proprietor of the supply desk in case of utilizing “SYNC TABLE”
- Be the proprietor of the supply schema in case of utilizing “SYNC SCHEMA”
Observe: The present model of SYNC solely helps upgrades of Exterior Tables. Please confer with the documentation for upgrading your Hive metastore Managed Tables and views to the Unity Catalog metastore. You can too use the desk clone command to create a duplicate of an present Hive metastore managed desk at a particular model to the Unity Catalog metastore. Learn this weblog to study additional about desk clones in Databricks.
There are two choices for the improve utilizing SYNC:
- SYNC TABLE: It upgrades a desk from Hive metastore to the Unity Catalog metastore
- SYNC SCHEMA: It upgrades all eligible tables in a Schema from Hive metastore to the Unity Catalog metastore
The SYNC command upgrades tables or schemas from Hive metastore to the Unity Catalog metastore. It may be used to create new tables within the Unity Catalog metastore from present tables in Hive metastore. It may be used to push updates from the supply tables in Hive metastore to the Unity Catalog metastore tables, which have been beforehand upgraded utilizing the SYNC command or through WebUI.
An non-obligatory DRY RUN clause can be utilized to judge the upgradability of the desk to Unity Catalog. Within the DRY RUN mode, the command checks if the given supply desk may be upgraded to the Unity Catalog metastore and offers a status_code and descriptive error message in case it can not improve. If the desk may be upgraded from Hive metastore to the Unity Catalog metastore then the standing code will present ’DRY_RUN_SUCCESS’ within the DRY RUN mode and SUCCESS when the desk is efficiently synced.
SYNC TABLE target_tbl FROM source_table [DRY RUN]
Please go to our documentation to lookup particulars on the parameters of SYNC command.
Observe: The consumer who runs the SYNC command would be the proprietor of the newly created tables
Observe: We’re utilizing pattern information for this instance. Databricks additionally offers a wide range of information units which are already mounted to DBFS in your Databricks workspace. You could find extra particulars right here.
Improve exterior desk to Unity Catalog
Create Hive metastore schema
use catalog hive_metastore; drop database if exists hmsdb_sync cascade; create database hmsdb_sync;
Create a Unity Catalog schema
use catalog principal; drop database if exists principal.ucdb_sync cascade; create database principal.ucdb_sync;
Create Exterior Desk in Hive metastore
-- create an exterior delta desk in Hive metastore drop desk if exists hive_metastore.hmsdb_sync.people_delta; create desk hive_metastore.hmsdb_sync.people_delta location "<<Your Object Storage Location>>" as choose * from delta.`dbfs:/databricks-datasets/studying-spark-v2/folks/folks-10m.delta` restrict 100000;
Choose the Desk to confirm
choose * from hive_metastore.hmsdb_sync.people_delta;
Execute Dry Run
sync desk principal.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta DRY RUN;
Observe the Outcomes of the Dry Run
Improve the Desk and observe the consequence
sync desk principal.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta;
Describe each supply and goal tables and evaluate
describe prolonged hive_metastore.hmsdb_sync.people_delta; desc prolonged principal.ucdb_sync.people_delta;
Describe the Hive Metastore desk and UC tables
Choose from the Goal desk to confirm the info
choose * from principal.ucdb_sync.people_delta;
Improve the schema and all eligible tables in a single go
sync schema principal.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync DRY RUN;
sync schema principal.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync;
On this weblog, we have now proven how you need to use the SYNC command to summary the complexity of upgrading your Hive metastore objects to Unity Catalog metastore. To study extra in regards to the SYNC command and how one can get began, please go to the guides (AWS, Azure). Please confer with the Pocket book to attempt totally different choices with SYNC and hold your Hive metastore schemas and exterior tables and your Unity Catalog metastore in sync.
SYNC may be run a number of instances to make sure Hive metastore objects and the Unity Catalog metastore objects are in sync. SYNC makes it seamless and straightforward for purchasers to undertake Unity Catalog and profit from unified governance options. In the event you now not want your Hive metastore schemas and tables, you’ll be able to drop them. Dropping an exterior desk doesn’t modify the info information in your cloud tenant.