Monday, February 22, 2021

Databricks CLI fundamentals

 

  1. pip3 install databricks-cli
  2. Check if installed: which databricks
  3. Check version: databricks —version
  4. databricks configure —token
  5. databricks clusters list
  6. To edit default configure: vi .databrickscfg
  7. Create scope: databricks secrets create-scope  —scope demo
  8. Put APP_key into the scope: databricks put —scope demo —key APP_key —string-value some-value
  9. To configure password file: 
    • vim password.txt (add at the end to remove newline character: set noendofline binary) use :wq to quit editing
    • databricks put —scope demo —key password —binary-file password.txt
  10. To delete scopes: databricks secrets delete-scope —scope demo
  11. To push project to Databricks workspace and load the .whl file to dbfs:
  1. To install the .whl file from CLI:
    • Get cluster-id: databricks clusters get —cluster-name demo
    • To install lib: databricks libraries install —cluster-id your-cluser-id —whl dbfs:/tmp/whl-name.whl
  2. To export changes made in Databricks and sync with local and use :git diff weather-wheel.py to see the differences:
  1. To import local changes to sync with Databricks(completely overwrite): databricks workspace import -o -l PYTHON weather-notebook.py /cli-demo/weather-notebook
  2. Some other interactions with Databricks CLI:
    • Start a cluster: databricks clusters start —cluster-id your-cluster-id
    • List jobs: databricks jobs list
    • Get job detail: databricks jobs get  —job-id job-id-number
    • Run a job: databricks jobs run-now —job-id job-id-number
    • Get running job detail: databricks runs get-output —run-id id-from-last-step
    • To terminate(not delete) a cluster: databricks clusters delete —cluster-id your-cluster-id
To create secrets using Databricks CLI:
- databricks secrets create-scope --scope  your-scope-name
- databricks secrets put --scope your-scope-name  --key username --string-value blabla
- databricks secrets list --scope  your-scope-name

To check secrets in Databricks:
- dbutils.secrets.listScopes()
- dbutils.secrets.list('demo')
- dbutils.secrets.get(scope="demo", key="app_key")

- trick to see the key:
app_key = []
for x in dbutils.secrets.get(scope="demo", key="app_key"):
  if x is not None:
    app_key.append(x)
print("app_key:", ' '.join(app_key))

To create .whl file:
- python -m build

Tuesday, February 16, 2021

Spark - register table in the Hive metastore

 %sql

Per best practice, we have created a partitioned table. However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Metastore.

MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore. When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore. User needs to run MSCK REPAIR TABLE to register the partitions. MSCK REPAIR TABLE on a non-existent table or a table without partitions throws an exception. Another way to recover partitions is to use ALTER TABLE RECOVER PARTITIONS.

health_tracker_processed = spark.read.table("health_tracker_processed")
health_tracker_processed.count()

Above returns 0 count. You have to use below to register partition.

%sql

MSCK REPAIR TABLE health_tracker_processed

then if you do count(), it will return result.

Thursday, February 4, 2021

GitLab - Fork using CLI

 

Fork using CLI:

  1. Clone our own project repo to local:  git clone https://gitlab.com/XXX.git
  2. Add upstream repo: git remote add upstream https://gitlab.com/XXXX
  3. To verify the remote has been added: git remote -v
  4. Pull changes from upstream to local: git pull upstream master
  5. Push to origin (Fork: our own repo): git push origin master
Reference : https://www.sitepoint.com/quick-tip-synch-a-github-fork-via-the-command-line/


But it is only click the 'Fork' button that you can create merge request to the upstream repo. So better use CLI only to sync with the upstream repo.