Press "Enter" to skip to content

Curated SQL Posts

Spark Streaming DStreams

Manish Mishra explains the fundamental abstraction of Spark Streaming:

Before going into details of the operations available on the DStream API, let us look at the input sources from which we can start a Stream. There are multiple ways in which we can get the inputs from e.g. Kafka, Flume, etc. Or simple Idle files. To get the details on the available input sources supported by Spark, you can refer to this section. As part of this blog, we will take the example of Kafka.

Read on to see an example of pulling data from Kafka and converting inputs into microbatches.

Comments closed

Why Root Containers are Troublesome

Andrew Pruski explains to us why it can be bad to have a container user running as root:

Recently I noticed that Microsoft uploaded a new dockerfile to the mssql-docker repository on Github. This dockerfile was under the mssql-server-linux-non-root directory and (you guessed it) allows SQL Server containers to run as non-root.

But why is running a container as root bad? Let’s run through an example.

Just as with physical devices and VMs before them, Docker containers can do a lot of damage if you’re logged in as root.

Comments closed

VM Storage Performance in the Cloud

Joey D’Antoni explains how storage architecture has changed from on-prem to the cloud:

This architecture design dates back to when a storage LUN was literally a built of a few disks, and we wanted to ensure that there were enough I/O operations per second to service the needs of the SQL Server, because we only had the available IO of a few disks.

As virtualization became popular storage architectures changes and the a SAN lun was carved out into many small extents (typically 512k-1MB depending on vendor) across the entire array. What this meant was that with modern storage there was no need to separate logs and data files, however some DBAs did, however in an on-premises world there was no penalty for this.

It’s important to keep up on these changes.

Comments closed

Converting JSON to Result Sets

Jack Vamvas shows how you can import data in JSON format and get tabular data in SQL Server:

It is possible to read a json file using T-SQL.There are a number of different methods.  By using the OPENROWSET functionality , ISJSON and OPENJSON function you can quickly read the file , check if the JSON is valid and then unpack the JSON into a SQL table. 

Read on for an example. This also performs reasonably well in practice, at least in my experience.

Comments closed

Non-Root SQL Server 2019 Containers

Vin Yu announces a change to Microsoft’s container configuration for SQL Server 2019:

The application process within most Docker containers is running as a root user meaning the process has root privileges within the container user space. The root user within the container is also the same root (uid 0) on the host machine, and if the user can break out of the container, they would have root permissions on the host. Running as root is convenient for development, testing and CI/CD use cases but for production use cases, it is safest to run SQL Server as a non-root process within the container. In this blog, we’re going to share with you how you can preview this upcoming improvement by creating your own non-root SQL Server container.

Vin has a quick demonstration of how it works.

Comments closed

Removing Duplicates with Window Functions

Anvesh Patel shows how to remove find duplicate rows using ROW_NUMBER() and then delete them in a statement:

Recently, I got one request for one script to delete duplicate records in PostgreSQL.

Most of the Database Developers have such a requirement to delete duplicate records from the Database.

Like SQL Server, ROW_NUMBER() PARTITION BY is also available in PostgreSQL.

Click through for separate solutions for MySQL, SQL Server, and Postgres.

Comments closed

SSIS Project Connections

Tim Mitchell shows how we can use project connections in SQL Server Integration Services:

In most use cases, the same connection will be used across multiple packages in the same project. In early versions of SSIS (pre-2012), each package would have its own connection manager for every connection used in that package. Creating and maintaining all those connection managers could be time-consuming as the number of packages grows. Starting with SQL Server 2012, Microsoft added project connections to SSIS, allowing for the creation of connections that were accessible across all packages in a project. Instead of having to create a copy of each connection manager in every package, developers can now simply create the connection at the project level. Project connections will automatically show up in the connection manager tray for all packages in that project.

Click through to see how you can create one and get rid of per-package connections.

Comments closed

Multi-Region Replication with Confluent Platform

David Arthur walks us through multi-region replication of Kafka clusters in the Confluent Platform 5.4 preview:

Running a single Apache Kafka® cluster across multiple datacenters (DCs) is a common, yet somewhat taboo architecture. This architecture, referred to as a stretch cluster, provides several operational benefits and unlocks the door to many uses cases. Stretch clusters provide better durability guarantees and make disaster recovery much easier by avoiding the problem of offset translation and restarting clients. However, in order to operate a reliable stretch cluster, datacenters must be relatively close to each other and have a very stable, low latency, and high-bandwidth connection among the DCs.

This changes with the preview release of Confluent Platform 5.4, which includes multi-region replication built directly into Confluent Server. Now operators can choose to replicate data on a per-region basis, synchronously or asynchronously, per topic. This functionality allows operators to increase data durability and automate client failover in the event of a disaster.

And of course all of those rules about RPO, RTO, etc. apply to this.

Comments closed

Diagnosing TCP SACKs-Related Slowdown in Databricks

Chris Stevens, et al, walk us through troubleshooting a slowdown after using Linux images which have been patched for the TCP SACKs vulnerabilities:

In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all but one of the tasks for the save operation would complete within a few minutes. Sorting the tasks in that stage by the Status column, it did not take long for there to be only one task in the RUNNING state. We had found our skewed task and the IP address in the Host column pointed us at the executor experiencing the regression.

This is a nice case study of network troubleshooting, so of course there are Wireshark screenshots in it.

Comments closed

Accessing Data in Azure Data Lake Storage Gen 2

James Serra gives us several methods to access data in Azure Data Lake Storage Gen 2:

With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.

Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.

Read on for the solution.

Comments closed