Author: Kevin Feasel

The problem with self-service BI is that you never quite know what your users are up to. For example, what data sources are they using? Are there hundreds of Excel files being used as data sources for reports that you don’t know about? If so, where are they? Could they and should they be replaced by a database or something else more robust? In this post I’ll show you how you can use Microsoft Flow and the Power BI REST API (see part 1 to find out how to create a Flow custom connector to call the Power BI API) to get the details of all the data sources used in all of the workspaces of your Power BI tenant.

I’ll admit that doing this turned out to be a bit trickier than I had expected. My plan was to use the GetDatasetsAsAdmin endpoint to get a list of all datasets, loop over each one and then call the (undocumented, but in the REST API’s Swagger file and therefore in my custom connector) GetDatsourcesAsAdmin endpoint to get the datasources in each dataset. Both these endpoints require administrative permissions to call, so I made sure my custom connector had the correct permissions (at least Tenant.Read.All – you can check this in the Azure Portal on the app you registered in Azure Active Directory) and I ran the Flow as a user with Power BI Admin permissions. But I kept getting 404 errors when requesting the data sources for certain datasets .

Chris explains why those 404s appear and what you can do about them.

Comments closed

Investigating UK Traffic With Principal Component Analysis

Published 2018-11-01 by Kevin Feasel

Michael Grogan shows us how to use Principal Component Analysis (PCA) to classify route segments in UK transportation data:

Specifically, let us assume that we wish to analyze traffic density for buses and coaches. The main thing we are interested in is the frequency of traffic across a particular route.

Let’s take an example. If buses cover 100 miles on a route that is 5 miles long within a certain timeframe, then the frequency will be greater than 100 miles covered on a route that is 10 miles long over the same time period.

Read on for an interesting example.

Comments closed

Checking Functional Dependencies In R Data Frames

Published 2018-11-01 by Kevin Feasel

John Mount shows us how to use the psagg function in wrapr to ensure that functional dependencies are valid:

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column zis not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a “pseudo aggregation” and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name y (instead of picking a new name such as max_y), we expect the y values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection y[[1]]) was also used in the solutions of the column indexing problem.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

In this post, John calls the act of grouping functional dependencies (where we can determine the value of y based on the value of x, for any number of columns in y or x) pseudo-aggregation.

Comments closed

Provisioning An Azure SQL Managed Instance

Published 2018-11-01 by Kevin Feasel

Frank Gill walks us through the process of provisioning an Azure SQL Managed Instance:

Once you have created the prerequisites, you are ready to create your first Managed Instance. As of now, Managed Instance is only available in the following subscription types:

Pay-As-You-Go

Enterprise Agreement

Cloud Service Provider

Information about subscription and resource limitations can be found here. I will update this with any changes.

Frank has a series of screenshots to show you the way.

Comments closed

Using datapasta To Paste Spreadsheet Data In R

Published 2018-11-01 by Kevin Feasel

Mara Averick shows us how we can use datapasta with RStudio to create good representative examples when asking questions:

So, you’ve been asked to make a reprex and you want to include a bit of data that you have in a spreadsheet. Meet {datapasta}, a package by Miles McBain that can make your life a whole lot easier. Once you’ve installed datapasta, you simply copy a selected number of rows from your spreadsheet (remember, this is a minimal reproducible example), and click the Paste as tribble option from the DATAPASTA section of the Addins dropdown

Click through for a demo.

Comments closed

Using Polybase External Tables To Connect To Oracle

Published 2018-11-01 by Kevin Feasel

Rajendra Gupta continues his Polybase series:

In part 2 of the series, we saw that the external table could be accessed similarly to a relational database table. One more advantage is that we can join them with any relational tables.

Let us see how we can join the external table with the relational DB tables. I have saved the data into a CSV file so we will import the table using my earlier article, SQL Server Data Import using SQL Operations Studio. Therefore, you can follow the article in the same way in the Azure Data Studio also. I will just give high-level steps to import data from flat file into Azure Data Studio in this article.

Click through for more.

Comments closed

Editing ArcGIS Maps In Power BI

Published 2018-11-01 by Kevin Feasel

Jason Bonello shows us the types of changes we can make to ArcGIS maps in Power BI:

Map themes – This allows a change in the style for the map and once can choose from location only, heatmaps or clustering (the last two are only available for point layers, that is when you select Points in the Location Type). Through the clustering option, one could group individual location points into larger circular clusters that fall within a cluster radius – giving a high level view and then the ability to drill down into each region. If heatmaps are chosen any values in the Size or Color will be ignored and the tooltips will not be available.

Read the whole thing.

Comments closed

“String Or Binary Data Would Be Truncated” Update In SQL Server 2017

Published 2018-11-01 by Kevin Feasel

Randolph West shows us how, in SQL Server 2017 CU 12, we can remove the scourge of “String or binary data would be truncated”:

This is how the error message looks now:

1

2

3

Msg 2628, Level 16, State 6, Procedure ProcedureName, Line Linenumber

String or binary data would be truncated in table ‘%.*ls’, column ‘%.*ls’.

Truncated value: ‘%.*ls’.

Notice how the table, column and value are all mentioned in the error message now, which makes debugging and troubleshooting much easier. Thank you Microsoft!

As of 24 October 2018, we can now get the full picture in SQL Server 2017 as well, provided we install Cumulative Update 12. I’d say this is worth the update in and of itself!

There is a trace flag involved, so check it out.

Comments closed

More Tabular Best Practices

Published 2018-11-01 by Kevin Feasel

Ginger Grant has a few more best practices for working with Analysis Services tabular models:

Modify Timestamps to Split Date and Time

When there is a field where the date and time are both needed, the values should be separated so that there is both a date field and a time field. Having date time in two fields assists in the dictionary encoding as the date and time fields can be separately sorted into columns where the values are the same, decreasing the number of dictionary entries. To further improve compression, only include the seconds if absolutely necessary, as add decreasing the cardinality will increase compression.

Click through for more tips.

Comments closed

Change Data Capture With Databricks Delta

Published 2018-10-31 by Kevin Feasel

Ameet Kini and Denny Lee show how to use Databricks Delta to handle change data capture from different processes:

With Databricks Delta, the CDC pipeline is now streamlined and can be refreshed more frequently: Informatica => S3 => Spark Hourly Batch Job => Delta. In this scenario, Informatica writes change sets directly to S3 using Informatica’s Parquet writer. Databricks jobs run at the desired sub-nightly refresh rate (e.g., every 15 min, hourly, every 3 hours, etc.) to read these change sets and update the target Databricks Delta table.

With minor changes, this pipeline has also been adapted to read CDC records from Kafka, so the pipeline there would look like Kafka => Spark => Delta. In the rest of this section, we elaborate on this process, and how we use Databricks Delta as a sink for their CDC workflows.

With one of our customers, we implemented these CDC techniques on their largest and most frequently refreshed ETL pipeline. In this customer scenario, Informatica writes a change set to S3 for each of its 65 tables that have any changes every 15 minutes. While the change sets themselves are fairly small (< 1000 records), their target tables can become quite large. Out of the 65 tables, roughly half a dozen are in the 50m-100m row range, and the rest are smaller than 50m. In Oracle, this pipeline would have run every 15 minutes, keeping in sync with Informatica. In Databricks Delta, we thought this would take close to an hour due to S3 latencies but ended up being pleasantly surprised with a 30 and even 15-minute refresh period depending on cluster size.

Click through for the rest of the story.

Comments closed