Tuesday, February 16, 2021

Spark - register table in the Hive metastore

 %sql

Per best practice, we have created a partitioned table. However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Metastore.

MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore. When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore. User needs to run MSCK REPAIR TABLE to register the partitions. MSCK REPAIR TABLE on a non-existent table or a table without partitions throws an exception. Another way to recover partitions is to use ALTER TABLE RECOVER PARTITIONS.

health_tracker_processed = spark.read.table("health_tracker_processed")
health_tracker_processed.count()

Above returns 0 count. You have to use below to register partition.

%sql

MSCK REPAIR TABLE health_tracker_processed

then if you do count(), it will return result.

No comments:

Post a Comment