Press "Enter" to skip to content

Day: September 1, 2022

Working with List Data in R using rrapply

Joris Chau pivots, unpivots, and unnests data in R:

The previous post showcases the rrapply() function in the minimal rrapply-package as a revised and extended version of base rapply() in the context of nested list recursion in R. For quick data exploration of a nested list it can make sense to keep the list in its original nested format to reduce the number of processing steps and minimize code complexity. As part of a more elaborate data analysis, if there is no specific reason to keep the nested data structure, it is often more practical to transform the nested list into a more convenient rectangular format and work with the unnested object (e.g. a data.frame) instead. In this follow-up post, we review the available (how) options in rrapply() to unnest or melt nested lists into a rectangular format in more detail and highlight the similarities and differences with respect to several common alternatives in R.

There’s a lot going on in this demo-packed post. H/T R-Bloggers.

Comments closed

Creating Multiple Output Files per Spark Task

Dmitry Tolpeko has a quick but helpful post:

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file […]

I had to cut it off right there to keep from spilling the beans here. Click through for Dmitry’s post to see what setting controls records per file, allowing you to keep opening those Spark output files in Excel.

Comments closed

The Importance of Data Shaping

Paul Turley shapes the youth of data:

Power BI is a new tool and dimensional modeling is an old idea. One of the challenges is that, like other modern self-service analytics products on the market, Power BI doesn’t force self-service data jockeys to transform their data before reporting with it. If you want to import a big, wide spreadsheet full of numbers and create charts in a Power BI report, knock yourself out. But, the solution won’t scale and you will inevitably run into walls when you try to make future enhancements. Similar problems arise from importing many tables from different sources and transactional systems. Several tables all chained together with creative mashups and relationships present their own set of problems. The first iteration of such an effort is usually a valuable discovery method and learning experience. Great… treat it as such; take notes, make note of the good parts and then throw it away and start over! In Fredrick Brooks’ “The Mythical Man Month“, he cites that for most engineering projects, the first six attempts should be abandoned before the team will be prepared to start over and complete the work successfully. He was a chemical engineer before working for IBM; and hopefully, our methods in the data engineering business are more effective then his 6-to-1 rule. But, this makes the case the prototypes and proof-of-concept projects are a critical part of the learning path.

The tools don’t make the rules.

Unless you’re talking about the lambda architecture, in which case that’s kind of accurate. But we’re not talking about that here.

Comments closed

PolyBase and Cosmos DB’s Core API

I have some fun integrating the Cosmos DB Core API with PolyBase:

PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.

Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.

Read on to see how it works and how you can get around some initial pain points. As a quick note, this only works with SQL Server on Windows, as SQL Server on Linux does not support generic ODBC drivers with PolyBase.

Comments closed

Azure Data Explorer UI Updates

Michal Bar has a couple of posts for us. First, updates to the desktop app Kusto Explorer:

Query Automation allows you to define a workflow that contains a series of queries with rules and logic that govern the order in which they are executed. Automations can be reused, and users can re-run the workflow, to get updated results. Upon completion, the saved Automation produces an analysis report, summarizing all queries results with additional insights.

Then, updates to the ADX web explorer:

It is now possible to embed Azure Data Explorer dashboards in 3rd party apps. This comes on top of allowing embedding of the Monaco editor in 3rd party apps.

Dashboard embedding allows you to easily share data with your customers in a way that allows them to interact and explore it.

Using the various feature flags, you can control the exact controls that will be part of the embedded dashboard experience. For example, you can decide to remove the share, and add connection menu items or others.

To learn more about dashboard embedding, please read this doc Embed dashboards 

Read on for the full changelog.

Comments closed

The Ins and Outs of Contained Availability Groups

Eitan Blumin does some digging:

Notice that all of the highlighted databases and server objects belong to the contained availability group, and all other databases and objects are not visible anymore. This is because our “master” and “msdb” databases are now the contained system databases which are separate from the actual instance system databases.

For more details about contained availability groups, such as interoperability support with other SQL Server features and more, check out the official Microsoft documentation at:

https://docs.microsoft.com/sql/database-engine/availability-groups/windows/contained-availability-groups-overview?view=sql-server-ver16

But there are several things which are not included with contained Availability Groups. click through for that list.

Comments closed

Checking Power BI Licensing Costs

Gilbert Quevauvilliers doesn’t want to waste money:

I recently was assisting a customer with their Power BI licensing and what I found is that in some instances they were having licenses for Power BI Pro and Power BI Premium Per User.

By going through their licenses and assigning the correct license I was able to save the customer approximately 20% on their Power BI licensing costs per month. And over a year this adds up to quite a bit!

This does look to be more confusing than it really ought to be. I’m not sure of any reason why you would want to have Pro + Premium at the same time, so that state should be unrepresentable.

Comments closed

Extracting Multiple Pages from a Website in Power Query

Matt Allington has a new project:

Every now and then when I have a Power BI project of interest to me, I like to create a video of the end to end process of building a new report. This allows me to share some “warts and all” real-world examples of how to go about building a Power BI report. It gives me a chance to show some concepts (such as creating functions and extracting multiple pages from websites) but also to show that these things are seldom smooth and error free.

Click through for a video demonstration of website data extraction and combination in a Power BI report.

Comments closed