I’ve been working a lot of Azure Databricks, and one of the things I’m really impressed by is how straightforward it has been to use source control to store not just notebooks, but other artifacts such as jobs and clusters. This is because the API is well documented, and it’s a case of pushing either a file (in the case of notebooks) or a bit of json config (in the case of everything else). Compared to SQL Agent Jobs for example, or SSIS or SSAS, or Azure SQL Data WareHouse, or ADF, it is amazing.
What has also helped me greatly is that there is a PowerShell module called azure.databricks.cicd.tools. Now I’ve already mentioned it in previous posts, but it’s definitely worth mentioning again. Basically everything I’ve wanted to do I’ve been able to do using this PowerShell module, which is a bit of a relief quite frankly because I didn’t want to have to write this stuff myself! By creating a git repo locally, you cna push branches into a workspace and make changes there, before pulling back locally and then pushing to Azure DevOps or GitLab or whatever and create pull requests and merge. This is something you cannot do with the integrated git in a workspace. Handy!
It’s easy to publish a folder to a workspace: you can run this -
Import-DatabricksFolder -BearerToken $BearerToken -Region $Region -LocalPath $localPath -DatabricksPath $dataBricksPath -Verbose
And boom! It’s right there in your workspace. And if you want to pull from the workspace, it is a case of running this -
Export-DatabricksFolder -BearerToken $BearerToken -Region $Region -LocalOutputPath $localOutputPath -ExportPath $exportPath -Verbose
And bosh! It’s pulled back to the local folder. And what makes it really neat is that (for .py files at least and I’m sure most other files types) git is smart enough to pick up where there have been modifications in notebooks. So you don’t have to concern yourself with pulling specific notebooks down from a workspace; you can pull the root folder of a project and only the notebooks that you have changed will appear in your git repo.
Except, of course, when notebooks you haven’t changed appear as modified. This is slightly concerning, and it happened to me recently. Looking at the chagne that was downloaded, a single line had been removed:
# COMMAND ----------
A cell is separated in a .py file by
# COMMAND ---------- and when a notebook is published it appears that the API will check remove a blank cell at the end of a notebook. So when a notebook is downloaded again then it will mark this removal of a blank cell as a change. Alright simple enough, remove the blank cell at the end and job done. However if there are multiple blank cells at the end it will only remove the last blank cell. This means that when the notebook is published again, the next last blank cell will also be removed, and will be marked as a change again. Chances are those cells are not required, so delete them from a notebook or the repo and push, and the problem will go away.