Press "Enter" to skip to content

Curated SQL Posts

Creating Multiple Output Files per Spark Task

Dmitry Tolpeko has a quick but helpful post:

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file […]

I had to cut it off right there to keep from spilling the beans here. Click through for Dmitry’s post to see what setting controls records per file, allowing you to keep opening those Spark output files in Excel.

Comments closed

The Importance of Data Shaping

Paul Turley shapes the youth of data:

Power BI is a new tool and dimensional modeling is an old idea. One of the challenges is that, like other modern self-service analytics products on the market, Power BI doesn’t force self-service data jockeys to transform their data before reporting with it. If you want to import a big, wide spreadsheet full of numbers and create charts in a Power BI report, knock yourself out. But, the solution won’t scale and you will inevitably run into walls when you try to make future enhancements. Similar problems arise from importing many tables from different sources and transactional systems. Several tables all chained together with creative mashups and relationships present their own set of problems. The first iteration of such an effort is usually a valuable discovery method and learning experience. Great… treat it as such; take notes, make note of the good parts and then throw it away and start over! In Fredrick Brooks’ “The Mythical Man Month“, he cites that for most engineering projects, the first six attempts should be abandoned before the team will be prepared to start over and complete the work successfully. He was a chemical engineer before working for IBM; and hopefully, our methods in the data engineering business are more effective then his 6-to-1 rule. But, this makes the case the prototypes and proof-of-concept projects are a critical part of the learning path.

The tools don’t make the rules.

Unless you’re talking about the lambda architecture, in which case that’s kind of accurate. But we’re not talking about that here.

Comments closed

PolyBase and Cosmos DB’s Core API

I have some fun integrating the Cosmos DB Core API with PolyBase:

PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.

Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.

Read on to see how it works and how you can get around some initial pain points. As a quick note, this only works with SQL Server on Windows, as SQL Server on Linux does not support generic ODBC drivers with PolyBase.

Comments closed

Azure Data Explorer UI Updates

Michal Bar has a couple of posts for us. First, updates to the desktop app Kusto Explorer:

Query Automation allows you to define a workflow that contains a series of queries with rules and logic that govern the order in which they are executed. Automations can be reused, and users can re-run the workflow, to get updated results. Upon completion, the saved Automation produces an analysis report, summarizing all queries results with additional insights.

Then, updates to the ADX web explorer:

It is now possible to embed Azure Data Explorer dashboards in 3rd party apps. This comes on top of allowing embedding of the Monaco editor in 3rd party apps.

Dashboard embedding allows you to easily share data with your customers in a way that allows them to interact and explore it.

Using the various feature flags, you can control the exact controls that will be part of the embedded dashboard experience. For example, you can decide to remove the share, and add connection menu items or others.

To learn more about dashboard embedding, please read this doc Embed dashboards 

Read on for the full changelog.

Comments closed

The Ins and Outs of Contained Availability Groups

Eitan Blumin does some digging:

Notice that all of the highlighted databases and server objects belong to the contained availability group, and all other databases and objects are not visible anymore. This is because our “master” and “msdb” databases are now the contained system databases which are separate from the actual instance system databases.

For more details about contained availability groups, such as interoperability support with other SQL Server features and more, check out the official Microsoft documentation at:

https://docs.microsoft.com/sql/database-engine/availability-groups/windows/contained-availability-groups-overview?view=sql-server-ver16

But there are several things which are not included with contained Availability Groups. click through for that list.

Comments closed

Checking Power BI Licensing Costs

Gilbert Quevauvilliers doesn’t want to waste money:

I recently was assisting a customer with their Power BI licensing and what I found is that in some instances they were having licenses for Power BI Pro and Power BI Premium Per User.

By going through their licenses and assigning the correct license I was able to save the customer approximately 20% on their Power BI licensing costs per month. And over a year this adds up to quite a bit!

This does look to be more confusing than it really ought to be. I’m not sure of any reason why you would want to have Pro + Premium at the same time, so that state should be unrepresentable.

Comments closed

Extracting Multiple Pages from a Website in Power Query

Matt Allington has a new project:

Every now and then when I have a Power BI project of interest to me, I like to create a video of the end to end process of building a new report. This allows me to share some “warts and all” real-world examples of how to go about building a Power BI report. It gives me a chance to show some concepts (such as creating functions and extracting multiple pages from websites) but also to show that these things are seldom smooth and error free.

Click through for a video demonstration of website data extraction and combination in a Power BI report.

Comments closed

Network Analysis in R via netUtils

David Schoch has an R package for us:

During the last 5 years, I have accumulated various scripts with (personal) convenience functions for network analysis and I also implemented new methods from time to time which I could not find in any other package in R. The package netUtils gathers all these functions and makes them available for anyone who may also needs to apply “non-standard” network analytic tools. In this post, I will briefly highlight some of the most prominent functions of the package. All available functions are listed in the README on github.

Click through to see what’s available in the package. H/T R-Bloggers.

Comments closed

Developing a Flask App with RStudio Connect

Parisa Gregg crosses the language barrier:

One of the Python applications you can deploy to RStudio Connect is Flask. Flask is a WSGI (Web Server Gateway Interface) web application framework and provides a Python interface to enable the building of web APIs. It is useful to data scientists, for example for building interactive web dashboards and visualisations of data, as well as APIs for machine learning models. Deploying a Flask app to a publishing platform such as RStudio Connect means it can then be used from anywhere and can be easily shared with clients.

This blog post focuses on how to deploy a Flask app to RStudio Connect. We will use a simple example but won’t go into detail on how to create Flask apps. If you are getting started in Flask you may find this tutorial useful.

Read on for a demo.

Comments closed

PolyBase in SQL Server 2022: Cosmos DB via MongoDB API

I have gotten back on the data virtualization wagon:

Back in the 2019 days, I noted a problem when CU2 of SQL Server 2019 came out. This is because the Cosmos DB collection I was using reported a wire version of 2 rather than the minimum version of 3. The official fix at that time was to create a new collection using the then-latest version of 3.6 but that didn’t work for me. My workaround was to use the old MongoDB drivers that shipped with SQL Server 2019 RTM.

Well, as of 2022, that solution won’t work anymore. The original MongoDB drivers don’t ship with SQL Server 2022, so we can’t use that workaround. I had a Cosmos DB account that was originally built on version 3.6. Even after upgrading to server version 4.2, it still reported wire version 2 when I connected to the endpoint that was relevant 3 years ago. Therein lies the solution to the problem.

It turns out there are two viable solutions now and I show both of them.

Comments closed