Press "Enter" to skip to content

Curated SQL Posts

PolyBase and External Column Names

I have another post looking at external columns on PolyBase V2 data sources:

I’m going to use external two tables in this experiment. In the left corner, we have some ORC files stored in Azure Blob Storage which we’ll represent as FireIncidents2017. In the right corner, we have data stored in a remote SQL Server instance which we’ll call LineItem. The data doesn’t really matter that much, but to give you an idea of where we’re going, I’ll show each table. 

There’s quite a bit you can do here.

Comments closed

Validating Errors in A/B Testing

Roland Stevenson shows us how to validate Type I and Type II errors when performing A/B tests in R:

In this post, we seek to develop an intuitive sense of what type I (false-positive) and type II (false-negative) errors represent when comparing metrics in A/B tests, in order to gain an appreciation for “peeking”, one of the major problems plaguing the analysis of A/B test today.

To better understand what “peeking” is, it helps to first understand how to properly run a test. We will focus on the case of testing whether there is a difference between the conversion rates cr_a and cr_b for groups A and B. We define conversion rate as the total number of conversions in a group divided by the total number of subjects. The basic idea is that we create two experiences, A and B, and give half of the randomly-selected subjects experience A and half B. Then, after some number of users have gone through our test, we measure how many conversions happened in each group. The important question is: how many users do we need to have in groups A and B in order to measure a difference in conversion rates of a particular size?

Read the whole thing. H/T R-Bloggers

Comments closed

Biml Support in Visual Studio Code

Cathrine Wilhelmsen takes us through Biml support in Visual Studio Code:

Please note that you only get syntax highlighting with this extension. You do not get the full Biml or .NET intellisense, the BimlScript preview pane, or the ability to generate SSIS packages from Biml. For those things, you will still need BimlExpress for Visual Studio.

However! If you simply want to view your Biml files in a lightweight editor, the Biml Support extension works beautifully

It’s not full support, but it’s something.

Comments closed

Database Page Allocations Function

Max Vernon takes us through the sys.dm_db_database_page_allocations Dynamic Management Function:

sys.dm_db_database_page_allocations is an undocumented SQL Server T-SQL Dynamic Management Function. This DMF provides details about allocated pages, allocation units, and allocation extents.

Read on for additional details. This is an undocumented function, so it might change between versions but it will give you an idea of how it works under the covers.

Comments closed

Using Bookmarks for Power BI Filters

Marc Lelijveld continues a series on storytelling with Power BI:

As said, being dynamic is a broad concept. Lets use the above shown example. As a report author, we can define that the end-user should be looking at an top 10 ranking of countries (right side of the report). Since the difference between number 9 and 10 in the ranking is so small, you might want to know what the difference is to number 11. Now, we can’t see that. We need to change the filter context to see the rest of the ranking.

Click through for a step by step example of what to do.

Comments closed

Managed Instance Challenges

Joey D’Antoni has a few real-world challenges with migrating to Azure SQL Managed Instances:

While DMS is pretty interesting tooling, I had mostly ignored it until recently. Functionally, the tool works pretty well. The problem is it requires a lot of privileges–you have to have someone who can create a service principal and you need to have the following ports open between your source machine and your managed instance:

– 443
– 53
– 9354
– 445
– 12000

While the scope of those firewall rules is limited, in a larger enterprise, explaining why you need port 445 open to anything is going to be challenging.

The technology is intriguing, though it does seem like there are still some kinks to work out.

Comments closed

Confluent Platform 5.3

Gaetan Castelein announces Confluent Platform 5.3:

Introducing Confluent Operator for Kubernetes
Kubernetes has become the open source standard for orchestrating containerized applications, but running stateful applications such as Kafka can be difficult and requires a specialized skill set. Thus, we decided to automate the process for you.

For the past few months, we have been working closely with a set of customers and partners as part of a preview program to gather their early feedback. We are now ready to release Confluent Operator, our enterprise-ready implementation of the Kubernetes Operator API to automate deployment and key lifecycle operations of Confluent Platform on Kubernetes.

Looks like they’ve been busy lately.

Comments closed

Stream Processing with Kafka

Satish Sharma has a four-part series covering stream processing with Apache Kafka. Part 1 gives us an overview of Kafka:

Apache Kafka is an open-source distributed stream processing platform originally developed by LinkedIn and later donated to Apache in 2011.

We can describe Kafka as a collection of files, filled with messages that are distributed across multiple machines. Most of Kafka analogies revolve around tying these various individual logs together, routing messages from producers to consumers reliably, replicating for fault tolerance, and handling failure gracefully. Its architecture inherits more from storage systems like HDFS, HBase, or Cassandra than it does from traditional messaging systems that implement JMS or AMQP. The underlying abstraction is a partitioned log, essentially a set of append-only files spread over several machines. This encourages sequential access patterns. A Kafka cluster is a distributed system that spreads data over many machines both for fault tolerance and for linear scale-out.

Part 2 covers terminology and concepts:

Kafka Streams API
Kafka Streams API is a Java library that allows you to build real-time applications. These applications can be packaged, deployed, and monitored like any other Java application — there is no need to install separate processing clusters or similar special-purpose and expensive infrastructures!

The Streams API is scalable, lightweight, and fault-tolerant; it is stateless and allows for stateful processing. 

Part 3 has you install and configure Kafka:

For quick testing, let’s start a handy console consumer, which reads messages from a specified topic and displays them back on the console. We will use the same to consumer to read all of our messages from this point forward. Use the following command: 

Linux -> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Windows -> bin\windows\kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic tutorial-topic --from-beginning

Part 4 is forthcoming.

Comments closed

Learning Confluent Kafka

Stephane Maarek has some pointers on where to go to learn Confluent Kafka:

I teach online about Apache Kafka, and a very frequent and recurring question I get is:

How can I learn Confluent Kafka?

Let’s get right to it!

I’ve gone through a couple of Stephane’s Kafka courses and they’re excellent. There’s still a lot more for me to go through but if you’re interesting in learning Kafka, this is a great method.

Comments closed

PolyBase and Azul Zulu OpenJDK

I have a post looking at one of the more interesting changes in SQL Server 2019 CTP 3.2:

One of the more interesting parts of SQL Server 2019 CTP 3.2’s release notes is the relationship between Microsoft and Azul Systems. Travis Wright covers it in some detail, as well as what it means for customers.

Prior to SQL Server 2019 CTP 3.2, installing PolyBase required an installation of Oracle’s Java Runtime Environment 7 Update 51 or higher, either directly from Oracle or through OpenJDK.

Java is still required if you want to read from or write to Hadoop or Azure Blob Storage. Oracle’s flavor of Java is no longer required, however.

Comments closed