Press "Enter" to skip to content

Day: August 26, 2019

Configuring an Azure Databricks Cluster

Jon Gurgul explains cluster settings in Azure Databricks:

We need compute to run our notebooks and this is achieved by creating a cluster. A cluster is merely a number of Virtual Machines behind the scenes used to form this compute resource. The benefit of Azure Databricks is that compute is only chargeable when on.

Let’s now click the Clusters icon and set up a simple cluster. Once you have loaded the page you can use the “Create Cluster” button.

Click through for an explanation of what each of the settings means.

Comments closed

The Basics of Apache Airflow

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Common VM Configuration Issues

Erik Darling takes us through some of the most common issues when setting up VMs to run SQL Server:

Everyone’s on VMs these days. That’s cool. Nothing against computers.

But here’s how people screw up SQL Server on VMs bigtime:

– Fewer than 4 cores (this is a licensing minimum)
– Multiple single-core sockets
– Not fully reserving memory
– Oversubscribing hosts

All four of those are bad and Erik explains why.

Comments closed

Azure SQL DB Serverless

John Morehouse takes us through the current public preview of Azure SQL Database Serverless:

Microsoft continues to expand the Azure ecosystem rapidly.  One of the most recent delivered improvements was Azure SQL Database Serverless option.    The new serverless model, which is currently in public preview, provides a compute tier for a single database that scales dynamically with the workload.   This new compute tier is optimal for single database that have unpredictable usage patterns.  Previously, you would provision the compute tier to your Azure SQL Database which allowed to have more granular control on scalability. The serverless tier scaling is effectively controlled by the service itself and will scale when needed.

This looks like it’d be good for a dev server, where the occasional startup cost is fine and there can be significant usage requirement differences based on time—after the devs go home, you’re not likely to need much compute.

Comments closed

Secure Strings in Powershell

Greg Moore shows how to build out secure strings in Powershell:

One of the more common problems I’ve faced with a particular client is setting up data extractions from outside sources such as SFTP servers. Since this client deals with PII data, it’s essential that this is done in as a secure manner as possible. For example, all connections need to be encrypted using protocols such as SFTP instead of FTP.

Securely connecting is particularly easy to accomplish if one uses a 3rd party module such as Posh-SSH. However, you are still stuck with the problem of logging into the remote SFTP server securely.

Keeping credentials secured is something which Powershell makes pretty easy. I can’t think of a reason why you’d ever need to keep credentials in plaintext here.

Comments closed

Estimated Execution Plans in Azure Data Studio

Dave Bland walks us through the “Explain” button in Azure Data Studio:

At first I was thinking maybe it is a short cut to help.  Nope, it is actually an easy way to get the estimated execution plan. The key is estimated execution plan. While in SQL Server Management Studio we have been able to easily get both the estimated or actual execution plan.

Just as a reminder, the actual execution plan requires the query to actually run. While the estimated plan will use statistics to generate the plan.  Generally, if the actual and estimated are capture very close to each other, you are more than likely not going to see a difference between the tow.

Now let’s talk about the Explain button.  It will return two things, the XML of the execution plan and a graphical representation of the execution plan.

There are a few differences between Azure Data Studio’s implementation of execution plans and SQL Server Management Studio’s.

Comments closed

The State of DevOps for Data Platform Professionals

Kendra Little summarizes the Accelerate: State of DevOps Report 2019 with a focus on what this means for data platform professionals:

While there are a ton of valuable insights in the report, in this post I will focus in on the findings which I believe are most relevant to those of us who work “close to a database.” There are three very interesting aspects of the research which hit close to home:

1. Speed and stability are not tradeoffs
2. Heavy change processes negatively impact speed and stability
3. Communities of practice are a common and successful tool to transform culture

Read on for Kendra’s detailed notes.

Comments closed

DBCC CHECKDB Error on Azure SQL Database

Arun Sirpal explains an error message on Azure SQL Database:

msg 7928, Level 16, State 1, Line 3
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 8921, Level 16, State 3, Line 3
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Read on to see what this means, as well as what it means for you.

Comments closed