IDEs and Cloudera Data Science Workbench

Bethann Noble walks us through some of the options available for IDEs operating against Cloudera Data Science Workbench:

Other coders on the team including ML and DevOps engineers often work in local IDEs such as PyCharm.  These applications run locally on the user’s computer and connect to CDSW remotely over SSH for code completion and execution.  They must be configured per user and are not associated at the project level in CDSW. The documentation provides sample instructions for the Professional Edition of PyCharm v2019.1.

They support both browser-based and local IDEs.

The SQL Notebook Experience, Featuring Powershell

Rob Sewell takes a break from book-writing and talks about using Powershell in SQL Notebooks:

Yes, it’s funny but also it carries a serious warning. Without understanding what it is doing, please don’t enable PowerShell to be run in a SQL Notebook that someone sent you in an email or you find on a GitHub. In the same way as you don’t open the word document attachment which will get a thousand million trillion pounddollars into your bank account or run code you copy from the internet on production without understanding what it does, this could be a very dangerous thing to do.

With that warning out of the way, there are loads of really useful and fantastic use cases for this. SQL Notebooks make great run-books or incident response recorders and PowerShell is an obvious tool for this. (If only we could save the PowerShell output in a SQL Notebook, this would be even better)

“It’s a bit hacky” is a generous statement, but it’s really cool that Rob figured out a way to do this. There is a Powershell kernel for Jupyter, but I’ve not had the best experience adding new kernels to Azure Data Studio (at least not F#’s kernel, which I tried).

Notebooks in Azure Databricks

Brad Llewellyn takes us through Azure Databricks notebooks:

Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn’t limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we can manually specify the language for each code block.

Before we get to the actually coding, we need to attach our new notebook to an existing cluster.  As we said, Notebooks are nothing more than an interface for interactive code.  The processing is all done on the underlying cluster.

Read on to learn how Databricks uses the notebook metaphor heavily in how you interact with it.

July Azure Data Studio Update

Alan Yu announces some great things in the July update to Azure Data Studio:

One of the most requested features from customers around the world is enhanced execution plan support. Although we have basic query plan support in Azure Data Studio, it’s not as robust as similar functionality built into SQL Server Management Studio and what other vendors provide.

Today, we’re pleased to announce that one of our valued Microsoft partners, SentryOne is shipping their SentryOne Plan Explorer extension for Azure Data Studio. This is a free extension that provides enhanced plan diagrams for queries that are run in Azure Data Studio, with optimized layout algorithms and intuitive color-coding to help quickly identify the most expensive operators affecting query performance.

The other big thing I like is that notebooks have keyboard shortcuts. These were two of the things keeping me from using ADS as much as I’d wanted. Now I’m that much closer to full-on migration.

Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

Embedding Notebooks on a Website

Eduardo Pivaral shows how to embed the results of a Jupyter notebook created in Azure Data Studio on a website:

Notebooks are a functionality available in Azure Data Studio, that allows you to create and share documents that may contain text, code, images, and query results. These documents are helpful to be able to share database insights and create runbooks that you can share easily.

Are you new to notebooks? don’t know what are the uses for it? want to know how to create your first notebook? then you can get started in ADS notebooks checking my article for MSSQLTips.com here.

Once you have created your first notebooks and share them among your team, maybe you want to share it on your website or blog for public access.
even when you can share the file for download, you can also embed it on the HTML code.

Be sure to read the comments too. Rendering notebooks is…an imperfect operation.

Azure Data Studio May Release

Alan Yu announces the May release of Azure Data Studio:

Since its release two months ago, the community continues to love SQL Notebooks. This month, we had a laser-eyed focus on quality of life bug fixes instead of new features. These improvements include:

– Markdown rendering improvements, including better support for notes and tables
– Usability improvements to the toolbar
– Markdown links for trusted notebooks no longer requires Command/Ctrl + click and can be clicked directly
– Improvements in cleaning up Jupyter processes after closing notebooks and reducing errors when starting multiple notebooks concurrently
– Improvements to SQL Notebook connections to ensure errors don’t occur when running two notebooks against the same database
– Improvements to notebook auto-scrolling to the currently executing cell when clicking the run cells button from the toolbar
– General stability and performance improvements

And based on some of the GitHub comments, I’m going to really like the June release if those changes all make it in.

Automating Jupyter Notebooks

I have some early thoughts on automating Jupyter notebooks:

In the command above, I included the date of execution. That way, I can script this to run once a day, storing results in an HTML file in some directory. Then, I can compare results over time and see when issues popped up.

I can also parse the resultant HTML if need be. Note that this won’t be trivial: even though the output looks like a simple [1] "PROBLEM ALERT", there’s a more complicated HTML blob. 

At some point I’ll probably have follow-up thoughts on the topic. Probably.

Azure Open Datasets

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and  the geography dimension, since they really helped me to get started with data warehousing.

So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Thoughts on SQL Notebooks

Emanuele Meazzo takes a look at the current state of SQL Notebooks in Azure Data Studio:

I’ve personally used SQL Notebook in my day-to-day work for Data Analysis, as the possibility to tweak the code and run it in the notebook greatly enhances the presentation of the data as oppose to a commented SQL Script ,as you cannot see all the query results in the same page too as opposed to a notebook; Moreover, a notebook (with or without results) can be exported in a read-only format like html or pdf to share the info with third parties, i.e. you can automate an analysis process that include code to be shared, cool stuff.

I think there are still a few (dozen) things to iron out before it’s a great experience, but they’re on the right path with it. If you haven’t checked out Azure Data Studio and its SQL Notebooks, give it a try sometime.

Categories

August 2019
MTWTFSS
« Jul  
 1234
567891011
12131415161718
19202122232425
262728293031