Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

Embedding Notebooks on a Website

Eduardo Pivaral shows how to embed the results of a Jupyter notebook created in Azure Data Studio on a website:

Notebooks are a functionality available in Azure Data Studio, that allows you to create and share documents that may contain text, code, images, and query results. These documents are helpful to be able to share database insights and create runbooks that you can share easily.

Are you new to notebooks? don’t know what are the uses for it? want to know how to create your first notebook? then you can get started in ADS notebooks checking my article for here.

Once you have created your first notebooks and share them among your team, maybe you want to share it on your website or blog for public access.
even when you can share the file for download, you can also embed it on the HTML code.

Be sure to read the comments too. Rendering notebooks is…an imperfect operation.

Azure Data Studio May Release

Alan Yu announces the May release of Azure Data Studio:

Since its release two months ago, the community continues to love SQL Notebooks. This month, we had a laser-eyed focus on quality of life bug fixes instead of new features. These improvements include:

– Markdown rendering improvements, including better support for notes and tables
– Usability improvements to the toolbar
– Markdown links for trusted notebooks no longer requires Command/Ctrl + click and can be clicked directly
– Improvements in cleaning up Jupyter processes after closing notebooks and reducing errors when starting multiple notebooks concurrently
– Improvements to SQL Notebook connections to ensure errors don’t occur when running two notebooks against the same database
– Improvements to notebook auto-scrolling to the currently executing cell when clicking the run cells button from the toolbar
– General stability and performance improvements

And based on some of the GitHub comments, I’m going to really like the June release if those changes all make it in.

Automating Jupyter Notebooks

I have some early thoughts on automating Jupyter notebooks:

In the command above, I included the date of execution. That way, I can script this to run once a day, storing results in an HTML file in some directory. Then, I can compare results over time and see when issues popped up.

I can also parse the resultant HTML if need be. Note that this won’t be trivial: even though the output looks like a simple [1] "PROBLEM ALERT", there’s a more complicated HTML blob. 

At some point I’ll probably have follow-up thoughts on the topic. Probably.

Azure Open Datasets

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and  the geography dimension, since they really helped me to get started with data warehousing.

So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Thoughts on SQL Notebooks

Emanuele Meazzo takes a look at the current state of SQL Notebooks in Azure Data Studio:

I’ve personally used SQL Notebook in my day-to-day work for Data Analysis, as the possibility to tweak the code and run it in the notebook greatly enhances the presentation of the data as oppose to a commented SQL Script ,as you cannot see all the query results in the same page too as opposed to a notebook; Moreover, a notebook (with or without results) can be exported in a read-only format like html or pdf to share the info with third parties, i.e. you can automate an analysis process that include code to be shared, cool stuff.

I think there are still a few (dozen) things to iron out before it’s a great experience, but they’re on the right path with it. If you haven’t checked out Azure Data Studio and its SQL Notebooks, give it a try sometime.

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Databricks Dashboards

Megan Quinn takes us through building dashboards with Apache Zeppelin on Databricks:

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

This isn’t quite a step-by-step guide but does spur on ideas.

Building a DMV Diagnostic Queries Notebook

Gianluca Sartori shows how you can use dbatools and Powershell to build a Jupyter notebook in Azure Data Studio for Glenn Berry’s DMV scripts:

For presentations, it is fairly obvious what the use case is: you can prepare notebooks to show in your presentations, with code and results combined in a convenient way. It helps when you have to establish a workflow in your demos that the attendees can repeat at home when they download the demos for your presentation.

For troubleshooting scenarios, the interesting feature is the ability to include results inside a Notebook file, so that you can create an empty Notebook, send it to your client and make them run the queries and send it back to you with the results populated. For this particular usage scenario, the first thing that came to my mind is running the diagnostic queries by Glenn Berry in a Notebook.

Obviously, I don’t want to create such a Notebook manually by adding all the code cells one by one. Fortunately, PowerShell is my friend and can do the heavy lifting for me.

This type of scenario is one of the best ones I see for database administrators: consistent, documented troubleshooting guides. Oh, and you can save results off if you need to review them later. This has the potential to be a killer feature for Azure Data Studio.

Diving Into SQL Notebooks

Rob Sewell tries out Azure Data Studio’s SQL notebooks, currently in preview:

OK, so now that we have the dependencies installed we can create a notebook. I decided to use the ValidationResults database that I use for my dbachecks demos and describe here. I need to restore it from my local folder that I have mapped as a volume to my container. Of course, I use dbatools for this 

Click through to see how to install and use SQL notebooks.


June 2019
« May