Press "Enter" to skip to content

Day: December 22, 2021

DevOps for Databricks

Anna Wykes starts off with bad news:

In this blog series I explore a variety of options available for DevOps for Databricks. This blog will focus on working with the Databricks REST API & Python. Why you ask? Well, a large percentage of Databricks/Spark users are Python coders. In fact, in 2021 it was reported that 45% of Databricks users use Python as their language of choice. This is a stark contrast to 2013, in which 92 % of users were Scala coders:

What is wrong with the world today?

Semi-seriously, though, do read Anna’s post, as it covers a variety of things you can do with the Databricks REST API, including cluster management and monitoring. I might be jumping the gun a bit, but I am a big fan of Gerhard Brueckl’s Powershell module for Databricks for this kind of work.

Comments closed

Working with GraphX in Spark

Tomaz Kastrun continues a series on Spark with a look at GraphX. Part 20 gives an overview of GraphX:

GraphX is Spark’s API component for graph and graph-parallel computations. GraphX uses Spark RDD and builds a graph abstraction on top of RDD. Graph abstraction is a directed multigraph with properties of edges and vertices.

Part 21 shows off the operators available:

Property graphs have collection of operators, that can take user-defined function and produce new graphs with transformed properties and structure. Core operators are defined in Graph and compositions of core operators are defined as GraphOps, and are automatically available as members of Graph. Each graph representation must provide implementations of the core operations and reuse many of the useful operations that are defined in GraphOps.

Click through for more information on graphs in the Spark ecosystem.

Comments closed

Dedicated SQL Pool Index, Distribution, and Partition Guidance

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Deploying SQL Server to Azure Container Instance via ARM

Rajendra Gupta builds an ARM template:

The Azure Resource Manager (ARM) template is a JavaScript Object Notation (JSON) file for deploying Azure resources automatically. You can use a declarative syntax to specify the resources, their configurations. Usually, if you need to deploy Azure resources, it might be a tiring experience of navigating through different services, their configurations. With the ARM templates, you no longer need to click and navigate around the portal. For example, you can use configure the template for Azure VM or Azure SQL Database deployment.

Click through for a step-by-step walkthrough. I will say, though, that I tend heavily to revise ARM templates the Azure Portal creates. They tend to make everything parameters, to the point where you get inundated with context-free decisions.

Comments closed

Drawing a Christmas Tree with KQL

Guy Reginiano has a task:

KQL isn’t just super-powerful, it’s also fun!
See how you can draw a tree using KQL and learn some of the functions and operators available.
Inspired by Feel free to design and share your own trees!

I kind of want to make this a Hello World type of exercise, ranking languages by their Christmas Tree Generation Capability Score, or CTGC. Maybe I’ll shorten it to TGC to make it a TLA.

Comments closed

Marking Replication Transactions Complete

Andrea Allred spams the “burn it down” button:

Replication is not my favorite, it is kind of far from my favorite. No further than that. Little further.

When it breaks, it can cause havoc and it always seems to break at the worst time. Recently we noticed that our logfile was massive (like 3 times the size of the database) and that was making many of the other processes painful. We didn’t know how long the log hadn’t been clearing so we got to burn it all (kind of).

The first thing I did was tell replication that we were done with all the transactions that had been committed.

I’d say about 40-50% of the pain of replication is in how difficult it is to troubleshoot. Transactional replication is an order of magnitude easier than merge replication, too, especially on systems of non-trivial size and scale. The single most common question I get is “When will this row be replicated to the other side?” I can’t answer that with merge replication. The second-most common question is, “Why are things slower right now than before?” Can’t answer that either…

Comments closed