Hello!

Last month I posted about the challenges of duration of releases getting tedious and what you can do about that. In a very similar theme, running a process like a build can also get tedious, and so you have to go looking for making improvements to a process before it really does start impeding people developing at a reasonable pace. I’m not going to repeat what I mentioned about three weeks ago, so go and have a read if you like, because what motivated me to make this change was getting bored of waiting for something to complete.

Point being, exporting notebooks from Databricks using azure.databricks.cicd.tools can take a long time if you have many notebooks. It takes about half a second to run for each notebook, so once you get into dozens of notebooks and run this multiple times a day I wanted to make this faster. Once again I decided to use Thread-Job, but because the longest task in exporting them was calling an API (500 ms), and every activity after that relied on the content of the API call, I could not merely thread the calling of the API, which was my initial starting point. And because Thread-Job does not have access to the context of the main PowerShell session, I could not merely wrap the private function Set-LocalNotebook.ps1 into a script block.

What I ended up doing was taking all of Set-LocalNotebook into a script block in Get-Notebooks, and removing those bits that needed the context of the main PowerShell session into Get-Notebooks and them assing them as arguements to the scriptblock when running Start-ThreadJob.

So in the case of the old Set-LocalNotebook, I could no longer call the private functions GetHeaders and Format-DataBricksFileName. Instead the results of these needed to be passed in to the ScriptBlock, as well as the paths of the two files that were created.

function Set-LocalNotebook ($DatabricksFile, $Language, $LocalOutputPath, $Format="SOURCE"){
    $DatabricksFileForUrl = Format-DataBricksFileName -DataBricksFile $DatabricksFile
    $uri = "$global:DatabricksURI/api/2.0/workspace/export?path=" + $DatabricksFileForUrl + "&format=$Format&direct_download=true"
    
    switch ($Format){
        "SOURCE" {
            $FileExtentions = @{"PYTHON"=".py"; "SCALA"=".scala"; "SQL"=".sql"; "R"=".r"}
            $FileExt = $FileExtentions[$Language]
        }
        "HTML"{
            $FileExt = ".html"
        }
        "JUPYTER"{
            $FileExt = ".ipynb"
        }
        "DBC"{
            $FileExt = ".dbc"
        }
    }
        
    $LocalExportPath = $DatabricksFile.Replace($ExportPath + "/","") + $FileExt
    $tempLocalExportPath = $DatabricksFile.Replace($ExportPath + "/", "") + ".temp" + $FileExt
    $LocalExportPath = Join-Path $LocalOutputPath $LocalExportPath
    $tempLocalExportPath = Join-Path $LocalOutputPath $tempLocalExportPath
    New-Item -Force -path $tempLocalExportPath -Type File | Out-Null
    $Headers = GetHeaders $null
    
    Try
    {
        # Databricks exports with a comment line in the header, remove this and ensure we have Windows line endings
        Invoke-RestMethod -Method Get -Uri $uri -Headers $Headers -OutFile $tempLocalExportPath
        $Response = Get-Content $tempLocalExportPath -Encoding UTF8
        $Response = $response.Replace("# Databricks notebook source", " ")
        Remove-Item $tempLocalExportPath
        if ($Format -eq "SOURCE"){
            $Response = ($Response.replace("[^`r]`n", "`r`n") -Join "`r`n")
        }

        Write-Verbose "Creating file $LocalExportPath"
        New-Item -force -path $LocalExportPath -value $Response -type file | out-null
    }
    Catch
    {
        Write-Error $_.ErrorDetails.Message
        Throw
    }
}

So I was able to remove those and the scriptblock is much more condensed and doesn’t require any context or knowledge of file paths. It merely executes PowerShell cmdlets that are available in any context.

 $scriptBlock = { param($DatabricksFile, $Format = "SOURCE", $Headers, $uri, $LocalExportPath, $tempLocalExportPath)         
        Try {
            New-Item -Force -path $tempLocalExportPath -Type File | Out-Null
            Invoke-RestMethod -Method Get -Uri $uri -Headers $Headers -OutFile $tempLocalExportPath 
            $Response = Get-Content $tempLocalExportPath -Encoding UTF8 
            $Response = $response.Replace("# Databricks notebook source", " ") 
            Remove-Item $tempLocalExportPath 
            if ($Format -eq "SOURCE") { 
                $Response = ($Response.replace("[^`r]`n", "`r`n") -Join "`r`n") 
            } 
            New-Item -force -path $LocalExportPath -value $Response -type file | out-null 
        }
        Catch {
            Write-Error $_.ErrorDetails.Message
            Throw
        }
    }

….and now those bits that were removed have been executed prior to running the ScripBlock and are instead passed in as arguments

$Notebook = $Object.path
            $NotebookLanguage = $Object.language
            Write-Verbose "Calling Writing of $Notebook ($NotebookLanguage)"
            $DatabricksFileForUrl = Format-DataBricksFileName -DataBricksFile $Notebook 
            $uri = "$global:DatabricksURI/api/2.0/workspace/export?path=" + $DatabricksFileForUrl + "&format=$Format&direct_download=true"
            switch ($Format) {
                "SOURCE" {
                    $FileExtentions = @{"PYTHON" = ".py"; "SCALA" = ".scala"; "SQL" = ".sql"; "R" = ".r" }
                    $FileExt = $FileExtentions[$NotebookLanguage]
                }
                "HTML" {
                    $FileExt = ".html"
                }
                "JUPYTER" {
                    $FileExt = ".ipynb"
                }
                "DBC" {
                    $FileExt = ".dbc"
                }
            }
            $LocalExportPath = $Notebook.Replace($ExportPath + "/", "") + $FileExt
            $tempLocalExportPath = $Notebook.Replace($ExportPath + "/", "") + ".temp" + $FileExt
            $LocalExportPath = Join-Path $LocalOutputPath $LocalExportPath
            $tempLocalExportPath = Join-Path $LocalOutputPath $tempLocalExportPath
            $threadJobs += Start-ThreadJob -Name $Notebook -ScriptBlock $ScriptBlock -ThrottleLimit $throttleLimit -ArgumentList @($Notebook, $Format, $Headers, $uri, $LocalExportPath, $tempLocalExportPath)

And now the time taken to download notebooks has reduced by about 50% for one project. Not bad going. This is the link for the whole script, and the PR was pulled not long after it was opened. The latest changes are in 2.0.66-preview.