Press "Enter" to skip to content

Curated SQL Posts

Why Transaction Logs are Zero-Initialized

Paul Randal explains why the transaction log needs to be zero-initialized before SQL Server starts up:

It’s all to do with crash recovery. SQL Server knows where crash recovery has to start for a database, but not where it ends – i.e. SQL Server does not persist the ‘most recent LSN’ for a database anywhere. This means it has to work out where the end of the log is (and by end, I mean the most recent log record persisted on disk, not the physical end of the log file).

Read on for the detailed explanation.

Comments closed

Test Those Restores

Bob Pusateri wants you to test your restores, probably right now:

I had a client that was hit by a ransomware virus, encrypting several of their systems including the database server. Not to worry, though, they had “full backups” of all the affected machines, done by a third-party backup utility. After taking a day to cleanse their network, they restored these backups onto their servers. Now it was just a simple matter of bringing all the applications back online, right? Well, not exactly…

Just because a backup completes with no error code doesn’t mean it’s really a successful backup.

Comments closed

Getting the Largest Value Per Group

Erik Darling has a series on getting the highest value using CROSS APPLY. Part 1 covers the no-index route:

Let’s say you wanna get the highest thing. That’s easy enough as a concept.

Now let’s say you need to get the highest thing per user. That’s also easy enough to visualize.

There are a bunch of different ways to choose from to write it.

Part 2 covers the yes-index route:

In this round, row number had a tougher time than other ways to express the logic.

It just goes to show you, not every query is created equal in the eyes of the optimizer.

I don’t think I’m spoiling too much by saying that you really want a good index in place when using CROSS APPLY in this manner.

Comments closed

SQL Server 2019 CTP 3.2

Asad Khan announces the release of SQL Server 2019 CTP 3.2:

With this release of SQL Server 2019 community technology preview 3.2, we are announcing the public preview of Big Data Clusters for SQL Server 2019. Big Data Clusters for SQL Server enables big data analytics within SQL Server. It brings HDFS and Apache Spark™ into SQL Server for scale out compute and storage.

Big data clusters allow you to deploy scalable clusters of SQL Server, Apache Spark™, and HDFS running on Kubernetes. It provides all the tools and systems to ingest, store, and prepare data for analysis as well as to train and operationalize machine learning models. It allows you to query external data sources through data virtualization and combine and analyze your high-value relational data with high-volume big data. You will be also be able to build and deploy scalable and productive data-driven applications in big data clusters.

There’s a good bit in this release and because we’re in the 3.x range, you should be able to upgrade directly.

Comments closed

Azul Java in SQL Server 2019

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.

Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

Comments closed

MLflow 1.1 Released

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.

Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Monitoring Backpressure in Apache Flink

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:

– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.

– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

Apache Phoenix on Cloudera

Krishna Maheshwari announces that Cloudera will officially support Apache Phoenix on its CDH and its upcoming Cloudera Data Platform:

Cloudera’s CDH releases have included Apache HBase which provides a resilient, NoSQL DBMS for customers operational applications that want to leverage the power of big-data.  These applications have grown into mission important and mission critical applications that drive top-line revenue and bottom-line profitability.  These applications include customer facing applications, ecommerce platforms, risk & fraud detection used behind the scenes at banks or serving AI/ML models for applications and enabling further reinforcement training of the same based on actual outcomes.

However, for many customers, HBase has been too daunting a journey 

Phoenix is one of my favorite examples of Feasel’s Law in action.

Comments closed

Median Calculation with T-SQL

Nisarg Upadhyay shows three ways to calculate the median in T-SQL:

To calculate the median of any dataset, we first need to arrange all values from the dataset in a specific order. After arranging the data, we must determine the middle value of the specified dataset. If the dataset contains an odd number of values, than the middle value of the entire dataset will be considered as a median. Following is the example:

Median (M) = value of ((X + 1)/2) th item. (x is the number of values in the dataset)

Honestly, CLR’s probably the best approach here if you want a fast calculation for a reasonably large number of rows. Using ML Services and R/Python is another alternative, though the launchpad spinup time will probably make it slower than CLR.

Comments closed