Import-DatabricksFolder - Now With Threading
One of the big challenges of running a release pipeline is the duration of a release and when is a good time for doing it. OK, that’s two challenges really, but it’s the first point about duration that I want to talk about. For example, in the world of SSDT and publishing changes it can take a very long time indeed if any of the changes result in a diff script that creates a new table and moves the data over, and then dropping the old table and renaming the new table. But I’m not here to talk about how to fix that either (hint; pre-model scripts to make changes first then the diff script won’t need to do large and slow movements of data.)
What I want to talk about today is importing a Databricks folder into a workspace. Because notebooks are stateless you have two options; replace the entire contents of a folder and if anything is overwritten then fine, or figure out which notebooks have changed and then replace those. The first option means you have total confidence that what is in source is what has been released, but means that it can be slow to publish one notebook at a time. The second choice means you only deploy what has changed. So it would certainly be faster, but to determine what has changed requires a level of sophistication in your deployment code, and by “sophistication” I mean “complicated”. And I am not a sophisticated person.
And so what to do? Well you could just ignore it until your deployments get unreasonably long, which is what most people do. And this is exactly wht I did until I started working on a project where releases were taking 5 minutes to deploy the notebooks. 5 minutes may not sound like a long time, but in effect if you have a pipeline that releases the notebooks prior to a PR being merged, then another pipeline with a release after it has been merged to master, then the artifact has to be deployed to one other environment, that’s 15 minutes. If those deployments were 30 seconds, then that’s a huge time saving.
So having pondered how to implement the “deploy only what has changed” to notebook deployments I noticed that it is the API interaction that takes the longest, and that getting the contents to see if it has changed and then deploying only what has changed will take longer than the original method. So what I have done is change each call to the API as a thread job and then checked the status of each of the jobs at the end. Because I am using Start-ThreadJob over Start-Job, a few other changes were required: the MinimumPowerShellVersion was bumped up to 5.1 and if the version of PowerShell was less than 7 then the
Start-ThreadJob module has to be installed if it is not already installed.
You might well be thinking “Why Start-ThreadJob”? Start-Job is fine but it is absolutely awful in VS Code, which I spend a lot of time working in, so Start-ThreadJob was required. However it is only installed by default in PowerShell 6.2 and above, but can be installed in all versions 5.1 and above, hence the version bump and check to see if it is installed.
I set the ThrottleLimit to be a conservative 2* the number of cores so as not to wear out meagre build agents. The work was merged sometime ago and the release was published last week, so if you are using azure.databricks.cicd.tools to publish notebooks you should see it running much faster.