Press "Enter" to skip to content

Author: Kevin Feasel

Understanding Data Gateways

James Serra walks us through the different data gateways available in Azure:

On-premises data gateway: Formally called the enterprise version.  Multiple users can share and reuse a gateway in this mode.  This gateway can be used by Power BI, PowerApps, Microsoft Flow or Azure Logic Apps.  For Power BI, this includes support for both scheduled refresh and DirectQuery.  To add a data source such as SQL Server that can be used by the gateway, check out Manage your data source – SQL Server.  To connect the gateway to your Power BI, you will sign in to Power BI after you install it (see On-premises data gateway in-depth).

Click through for more details on additional gateways.

Comments closed

Pivoting Data

Jana Sattainathan explains the PIVOT operator:

The results are so much easier to look at and comprehend, aren’t they? All object types for a schema are on a single line and it is easy for us to do impact analysis visually.

Sometimes doing it in T-SQL is the best approach, but pivoting is generally something which is cheaper in the application tier, whether you’re building a report, dashboard, or web app.

Comments closed

Checking Last CHECKDB Date Using DBCC PAGE

Wayne Sheffield shows how to get the last time DBCC CHECKDB ran on each database:

The “trick” to making this work is to encapsulate the DBCC command as a string, and to call it with the EXECUTE () function. This is used as part of an INSERT INTO / EXECUTE statement, so that the results from DBCC PAGE are inserted into a table (in this case a temporary table is used, although a table variable or permanent table can also be used). There are three simple steps to this process:

  1. Create a table (permanent / temporary) or table variable to hold the output.

  2. Insert into this table the results of the DBCC PAGE statement by using INSERT INTO / EXECUTE.

  3. Select the data that you are looking for from the table.

Read on for his code as well as important caveats.

Comments closed

Spark Versus Flink

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams.  We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Comments closed

Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Comments closed

Understanding HDFS Disk Checks

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Comments closed

Using Polybase To Insert Into HDFS

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales:  with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster?  And if so, how close to linear scale can I get?

Comments closed

Operator Elapsed Time

Kendra Little shows off a really cool feature in SQL Server 2016 & 2014 SP2:

SQL Server now shows Actual Elapsed CPU Time and Actual Elapsed Time (duration) for each operator in an Actual Execution Plan

For SQL Server 2016 and 2014 SP2 and higher, actual execution plans contain a bunch of new information on each operator, including how much CPU they burn, how long it takes, and how much IO is done by that operator. This was a little hard to use for a while because the information was only visible in the XML of the execution plan.

Check out Kendra’s post for more details, including a couple caveats.

Comments closed

New T-SQL Features

Slava Murygin looks at some new functions in the vNext CTP 1.1:

Since Microsoft introduced XML support in SQL Server, the most common string concatenation technique was use of “XML PATH(”)” like this:

SELECT SUBSTRING(
(SELECT ‘, ‘ + name FROM master.sys.tables
FOR XML PATH(”))
,3,8000);
GO

Now you can aggregate your strings by using function “STRING_AGG”:

SELECT STRING_AGG(name, ‘, ‘) FROM master.sys.tables;

Read on for the other three.  This aggregation function, however, would make some of my code a lot simpler and easier to explain to junior database developers.  I just want it to perform well is all.

Comments closed