Nan's Blog: February 2021

Monday, February 22, 2021

Databricks CLI fundamentals

pip3 install databricks-cli
Check if installed: which databricks
Check version: databricks —version
databricks configure —token
databricks clusters list
To edit default configure: vi .databrickscfg
Create scope: databricks secrets create-scope —scope demo
Put APP_key into the scope: databricks put —scope demo —key APP_key —string-value some-value
To configure password file:

vim password.txt (add at the end to remove newline character: set noendofline binary) use :wq to quit editing
databricks put —scope demo —key password —binary-file password.txt

To delete scopes: databricks secrets delete-scope —scope demo
To push project to Databricks workspace and load the .whl file to dbfs:

To install the .whl file from CLI:

Get cluster-id: databricks clusters get —cluster-name demo
To install lib: databricks libraries install —cluster-id your-cluser-id —whl dbfs:/tmp/whl-name.whl

To export changes made in Databricks and sync with local and use :git diff weather-wheel.py to see the differences:

To import local changes to sync with Databricks(completely overwrite): databricks workspace import -o -l PYTHON weather-notebook.py /cli-demo/weather-notebook
Some other interactions with Databricks CLI:

Start a cluster: databricks clusters start —cluster-id your-cluster-id
List jobs: databricks jobs list
Get job detail: databricks jobs get —job-id job-id-number
Run a job: databricks jobs run-now —job-id job-id-number
Get running job detail: databricks runs get-output —run-id id-from-last-step
To terminate(not delete) a cluster: databricks clusters delete —cluster-id your-cluster-id

To create secrets using Databricks CLI:

- databricks secrets create-scope --scope your-scope-name

- databricks secrets put --scope your-scope-name --key username --string-value blabla

- databricks secrets list --scope your-scope-name

To check secrets in Databricks:

- dbutils.secrets.listScopes()

- dbutils.secrets.list('demo')

- dbutils.secrets.get(scope="demo", key="app_key")

- trick to see the key:

app_key = []

for x in dbutils.secrets.get(scope="demo", key="app_key"):

if x is not None:

app_key.append(x)

print("app_key:", ' '.join(app_key))

To create .whl file:

- python -m build

Tuesday, February 16, 2021

Spark - register table in the Hive metastore

%sql

DROP TABLE IF EXISTS health_tracker_processed;
​
CREATE TABLE health_tracker_processed                        
USING PARQUET                
LOCATION "/dbacademy/$username/DLRS/healthtracker/processed"

Per best practice, we have created a partitioned table. However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Metastore.

MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore. When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore. User needs to run MSCK REPAIR TABLE to register the partitions. MSCK REPAIR TABLE on a non-existent table or a table without partitions throws an exception. Another way to recover partitions is to use ALTER TABLE RECOVER PARTITIONS.

health_tracker_processed = spark.read.table("health_tracker_processed")

health_tracker_processed.count()

Above returns 0 count. You have to use below to register partition.

%sql

MSCK REPAIR TABLE health_tracker_processed

then if you do count(), it will return result.

Thursday, February 4, 2021

GitLab - Fork using CLI

Fork using CLI:

Clone our own project repo to local: git clone https://gitlab.com/XXX.git
Add upstream repo: git remote add upstream https://gitlab.com/XXXX
To verify the remote has been added: git remote -v
Pull changes from upstream to local: git pull upstream master
Push to origin (Fork: our own repo): git push origin master
Reference : https://www.sitepoint.com/quick-tip-synch-a-github-fork-via-the-command-line/

But it is only click the 'Fork' button that you can create merge request to the upstream repo. So better use CLI only to sync with the upstream repo.

Nan's Blog