Press "Enter" to skip to content

Category: Notebooks

Parallel Loading in Spark Notebooks

Dustin Vannoy answers some questions:

I received many questions on my tutorial Ingest tables in parallel with an Apache Spark notebook using multithreading. In this video and post I address some of the questions that I couldn’t just answer in the YouTube comments. Watch the video for more complete answers but here are quick responses with links to examples where appropriate.

Click through for the video and some text versions. Dustin includes examples for Synapse and Databricks.

Comments closed

Executing Multiple Notebooks in one Spark Pool with Genie

Shalu Ganotra Chadra, et al, explain what Synapse Genie is:

The Genie framework is a metadata driven utility written in Python. It is implemented using threading (ThreadPoolExecutor module) and directed acyclic graph (Networkx library). It consists of a wrapper notebook, that reads metadata of notebooks and executes them within a single Spark session. Each notebook is invoked on a thread with MSSparkutils.run() command based on the available resources in the Spark pool. The dependencies between notebooks are understood and tracked through a directed acyclic graph.

Read on for more information about how you can use it and what the setup process looks like.

Comments closed

Sharing Results between Notebooks with MSSparkUtils

Liliam Leme provides an answer to a common Synapse Spark pool question:

I’ve been reviewing customer questions centered around “Have I tried using MSSparkUtils to solve the problem?”

One of the questions asked was how to share results between notebooks. Every time you hit “run” in a notebook, it starts a new Spark cluster which means that each notebook would be using different sessions. Making it impossible to share results between executions of notebooks. MSSparkUtils offers a solution to handle this exact scenario. 

Read on to see what MSSparkUtils is and how it helps in this case.

Comments closed

Choosing between Synapse Spark Notebooks or Job Definitions

Arun Sethia and Arshad Ali explain when you might use a Spark notebook versus a job definition:

Synapse Spark Notebook is a web-based (HTTP/HTTPS) interactive interface to create files that contain live code, narrative text, and visualizes output with rich libraries for spark based applications. Data engineers can collaborate, schedule, run, and test their spark application code using Notebooks. Notebooks are a good place to validate ideas and do quick experiments to get insight into the data. You can integrate the Synapse Notebook into Synapse pipeline.

The Notebook allows you to combine programming code with markdown text and perform simple visualizations (using Synapse Notebook chart options and open-source libraries). In addition, running code will supply immediate feedback, output, and progress tracking within Notebook.

Click through for the comparison.

Comments closed

Running Diagnostic Notebooks via Powershell

Tracy Boggiano kicks off a notebook:

As part of starting a new job you need a way to get a good inventory of basic information of SQL Server instances.  Once you have done what I outlined in this blog post.  I find it helpful to run Glenn Alan Berry’s Diagnostic Notebooks against all the instances to get a static point in time snapshot of all the properties and some performance information.  While dbatools has commands under the Community Tools section for running the data into spreadsheets and creating notebooks for the newest queries I like to go get Glenn’s because he has all the comments in there of what the mean and links to resources about things.  So you can explore that route if you like but I’ll be manually downloading them from Glenn’s site for that reason.  To able to open the notebooks successfully in ADS look for the tip on my blog post on Tools I Use on My Jumpbox for opening large notebooks.

Click through for a script Tracy uses to kick off the notebook regardless of the SQL Server version.

Comments closed

Software Engineering Practices for Notebooks

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

1. A report that runs regularly based on newer data and evolving business logic.

2. An ETL pipeline that needs to run on a regular schedule, or continuously.

3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

Separating Code from Presentation with Jupyter

John Mount disaggregates Jupyter notebook results:

As I switch back and forth between R and Python projects for various clients and partners, I got to thinking: “is there an easy way to separate code from presentations in Jupyter notebooks?”

The answer turns is yes. Jupyter itself exposes a rich application programming interface in Python. So it is very easy to organize Jupyter’s power into tools that give me a great data science and analysis workflow in Python.

Read on to see how.

Comments closed

SQL Tools Updates

Timi Oshin has updates on SSMS and Azure Data Studio:

Azure Data Studio 1.35 now supports easier keyboard navigation in notebooks without mouse clicking. This is done by hitting the Esc key and navigating between cell rows using the Up and Down arrow keys. To enter edit mode, hit the Enter key on the keyboard. The new Table Designer preview feature supports creating new tables and editing existing tables on a connected SQL Server instance. This is a highly requested product enhancement and enables more productive schema management with a modern, streamlined UX.

Haha! It only took several years but my hectoring finally pays off. Now for the full set of Jupyter keyboard shortcuts…

Comments closed

Working with Notebooks in Azure ML

I have started a new series:

In the prior series, Low-Code Machine Learning with Azure ML, we saw how to get started with Azure Machine Learning in a fairly pain-free way, especially for developers getting started with machine learning. In this series, I will assume that you already know all of those details and instead, we’re going to go full-code.

There are a few different ways in which we can go full-code with Azure ML. Today, we’re going to look at the easiest of those methods: using Jupyter notebooks within Azure ML Studio.

Read on for the first post in the series.

Comments closed