Press "Enter" to skip to content

Category: Notebooks

Azure Open Datasets

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and  the geography dimension, since they really helped me to get started with data warehousing.

So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Comments closed

Thoughts on SQL Notebooks

Emanuele Meazzo takes a look at the current state of SQL Notebooks in Azure Data Studio:

I’ve personally used SQL Notebook in my day-to-day work for Data Analysis, as the possibility to tweak the code and run it in the notebook greatly enhances the presentation of the data as oppose to a commented SQL Script ,as you cannot see all the query results in the same page too as opposed to a notebook; Moreover, a notebook (with or without results) can be exported in a read-only format like html or pdf to share the info with third parties, i.e. you can automate an analysis process that include code to be shared, cool stuff.

I think there are still a few (dozen) things to iron out before it’s a great experience, but they’re on the right path with it. If you haven’t checked out Azure Data Studio and its SQL Notebooks, give it a try sometime.

Comments closed

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Comments closed

Databricks Dashboards

Megan Quinn takes us through building dashboards with Apache Zeppelin on Databricks:

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

This isn’t quite a step-by-step guide but does spur on ideas.

Comments closed

Building a DMV Diagnostic Queries Notebook

Gianluca Sartori shows how you can use dbatools and Powershell to build a Jupyter notebook in Azure Data Studio for Glenn Berry’s DMV scripts:

For presentations, it is fairly obvious what the use case is: you can prepare notebooks to show in your presentations, with code and results combined in a convenient way. It helps when you have to establish a workflow in your demos that the attendees can repeat at home when they download the demos for your presentation.

For troubleshooting scenarios, the interesting feature is the ability to include results inside a Notebook file, so that you can create an empty Notebook, send it to your client and make them run the queries and send it back to you with the results populated. For this particular usage scenario, the first thing that came to my mind is running the diagnostic queries by Glenn Berry in a Notebook.

Obviously, I don’t want to create such a Notebook manually by adding all the code cells one by one. Fortunately, PowerShell is my friend and can do the heavy lifting for me.

This type of scenario is one of the best ones I see for database administrators: consistent, documented troubleshooting guides. Oh, and you can save results off if you need to review them later. This has the potential to be a killer feature for Azure Data Studio.

Comments closed

Diving Into SQL Notebooks

Rob Sewell tries out Azure Data Studio’s SQL notebooks, currently in preview:

OK, so now that we have the dependencies installed we can create a notebook. I decided to use the ValidationResults database that I use for my dbachecks demos and describe here. I need to restore it from my local folder that I have mapped as a volume to my container. Of course, I use dbatools for this 

Click through to see how to install and use SQL notebooks.

Comments closed

Azure Data Studio and T-SQL Notebooks

Constantine Kokkinos takes us through the preview of T-SQL notebooks in Azure Data Studio:

I have been waiting for word about the new Notebook functionality in Azure Data Studio, and when I heard it was available in the insider build, I jumped in to take a look.

Jupyter Notebook is a web application that allows you to host programming languages, run code (often with different programming languages), return results, annotate your data, and importantly, share the source controlled results with your colleagues.

This is an exciting addition; SQL is a great language to combine with notebooks given the exploratory nature of the language. I’m going to wait until it’s officially out before diving too far into it, though.

Comments closed

Parameters in Rmarkdown Files

Neil Saunders shows how you can parameterize Rmarkdown files, consequently making changes easy later:

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

The example is small but important.

Comments closed

Building Credit Scorecards

Andre Violante uses SAS to build credit scorecards and analyze credit data:

For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.

Click through for the analysis.

Comments closed

A Functional Approach To PySpark

Tristan Robinson shows us how we can implement a transform function which makes Python code look a little bit more functional:

After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.

The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.

Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.

Comments closed