2019-02-12 – Curated SQL

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.

The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.

Comments closed

Generating Plots Like The BBC

Published 2019-02-12 by Kevin Feasel

David Smith has some notes on bbplot, a ggplot2 extension the BBC uses for its graphics:

If you’re looking a guide to making publication-ready data visualizations in R, check out the BBC Visual and Data Journalism cookbook for R graphics. Announced in a BBC blog post this week, it provides scripts for making line charts, bar charts, and other visualizations like those below used in the BBC’s data journalism.

I’m still reading through the linked cookbook but it’s a good one.

Comments closed

Generating SSRS Subscription Agent Job Commands

Published 2019-02-12 by Kevin Feasel

Craig Porteous has a quick script to generate T-SQL commands to start and stop SQL Agent jobs tied to Reporting Services subscriptions:

This is a query I would run when I needed to quickly make bulk changes to Reporting Services subscriptions. It’s part of an “emergency fix” toolkit.
Maybe a DB has went down and I have to quickly suspend specific subscriptions or locate Agent jobs for subscriptions. This was always a quick starting point.
I could take the generated Start, Enable and Disable commands and record these in tickets or email threads to demonstrate actions taken. There are other ways to make bulk changes to SSRS subscriptions involving custom queries but this can be run immediately, I don’t have to tailor a WHERE clause first. I also wrote previously on managing subscription failures.

Click through for the script.

Comments closed

Generating Fake Data

Published 2019-02-12 by Kevin Feasel

Rich Benner shows us how to use the Faker library in Python to generate test data:

There are far more options when using Faker. Looking at the official documentation you’ll see the list of different data types you can generate as well as options such as region specific data.
Go have fun trying this, it’s a small setup for a large amount of time saved.

These types of tools can be great for generating a bunch of data but come with a couple of risks. One is that in the fake addresses Rich shows, ZIP codes don’t match their states at all, so if your application needs valid combos, it can cause issues. The other problem comes from distributions: generated data often gets created off of a uniform distribution, so you might not find skewness-related problems (e.g., parameter sniffing issues) strictly in your test data.

That said, easily generating test data is powerful and I don’t want to let the good be the enemy of the great.

Comments closed

Improving The Azure Automated AG Experience

Published 2019-02-12 by Kevin Feasel

Allan Hirt would like to see a few improvements to the experience when creating availability groups on Azure VMs:

What does this mean? To have a supported WSFC-based configuration (doesn’t matter what you are running on it – could be something non-SQL Server), you need to pass validation. xFailOverCluster does not allow this to be run. You can create the WSFC, you just can’t validate it. The point from a support view is that the WSFC has to be vetted before you create it. Could you run it after? Sure, but you still have no proof you had a valid configuration to start with which is what matters. This is a crucial step for all AGs and FCIs, especially since AGs do not check this whereas the installation process for FCIs does.
If you look at MSFT_xCluster, you’ll see what I am saying is true. It builds the WSFC without a whiff of Test-Cluster. To be fair, this can be done in non-Azure environments, too, but Microsoft givs you warnings not to do that for good reason. I understand why Microsoft did it this way. There is currently no tool, parser, or cmdlet to examine the output of Test-Cluster results. This goes back to why building WSFCs is *very* hard to automate.

I’m not sure how easy some of these fixes would be, but they’d definitely be nice.

Comments closed

Installing SQLCMD On Linux

Published 2019-02-12 by Kevin Feasel

James Livingston shows us how to install SQLCMD on Red Hat Enterprise Linux:

Microsoft has decent instructions on installing SQLCMD on linux. My current company blocks access to external yum repositories, so I followed the “Offline” installation. There’s two packages to install:
1) msodbcsql – aka Microsoft’s ODBC driver for SQL Server
2) mssql-tools – SQLCMD

Read on for the step-by-step instructions.

Comments closed

Using Profiler To Get Power Query Timings

Published 2019-02-12 by Kevin Feasel

Chris Webb shows us how we can combine DAX Studio with Profiler in order to time our Power Query operations:

And there you have it, exact timings for each of the Power Query M queries associated with each of the tables in your dataset. Remember that the time taken by each Power Query M query will include the time taken by any other queries that it references, and it does not seem to be possible to find out the amount of time taken by any individual referenced query in Profiler.
There is a lot more interesting information that can be found in this way: for example, dataset refresh performance is not just related to the performance of the Power Query M queries that are used to load data; time is also needed to build all of the structures inside the dataset by the Vertipaq engine once the data has been returned, and Profiler gives you a lot of information on these operations too.

Check it out if you do any work with Power BI.

Comments closed

Implicit Parent Reference On Foreign Keys

Published 2019-02-12 by Kevin Feasel

Deborah Melkin shows us an interesting way of creating foreign keys:

No matter how long you work with something, you can always find something that you never knew before. I found one about foreign keys this week.
I was reviewing SQL scripts for coworkers and I noticed that the foreign keys were written without referencing the parent table’s column. But the script didn’t fail and created the foreign keys correctly. So how did this work?

I don’t think I’ve ever seen this syntax either. I’m not a big fan of it for the same reason that Deborah isn’t a big fan of it: adding a couple more words does clarify your intent, and so add the words.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28

Day: February 12, 2019

No-Code ML On Cloudera Data Science Workbench

Generating Plots Like The BBC

Generating SSRS Subscription Agent Job Commands

Generating Fake Data

Improving The Azure Automated AG Experience

Installing SQLCMD On Linux

Using Profiler To Get Power Query Timings

Implicit Parent Reference On Foreign Keys