Press "Enter" to skip to content

Author: Kevin Feasel

Automating Azure Storage To Move Between Tiers

Ryan Adams built a process to save money on storage costs for a customer’s test environment:

One of the best things about Azure, and the cloud in general, is we can automate most anything, and we are going to look at how to automate Azure VM Storage.  This allows us to come up with some outside-of-the-box solutions.  I had a customer with a road block that we were able to work around by automating some things with their Azure Virtual Machines.

Their challenge was that they wanted to move their test and development environments to Azure, but the storage cost was prohibitive.  They needed premium storage to mimic their production environment, but it was not financially viable for test and development so they were going to keep it all on premises.  During our conversations I learned that they only test between 8am and 5pm, Monday through Friday.  My suggestion was that we put their databases on cheaper storage during off times and only premium when they are actively using it.

This doesn’t look like a one-hour task but if you’re in need of some cost savings on storage in non-production environments, check out Ryan’s scripts.

Comments closed

Microsoft Data Platform Bug Reporting Links

Brent Ozar has put together a compendium of where you should go if you want to file bug reports or feature requests for different products in the Microsoft data platform space:

Azure Data Studio – open an issue in the Github repo. While you open an issue, Github helps by searching the existing issues as you’re typing, so you’ll find out if there’s already a similar existing issue.

Click through for all of the links. I personally just yell skyward in the hopes that they hear me and fix my problems. It doesn’t work very often so I don’t recommend it as a strategy.

Comments closed

Issues From Using gMSA Accounts with Docker

Michal Poreba shares some lessons from trying to set up Docker and SQL Server to use gMSA accounts:

While in the end I was able to make it work on Windows Server 2016, 1803, 2019 and 1809 I wasted some time trying to make it work with docker 17.06. Unsuccessfully. Docker 18.09.1 and 18.09.2 worked every time.
There are some reports of intermittent problems with specific OS updates breaking stuff, like the one here but I wasn’t able to reproduce it. I wonder if the updates changes something else that it causing problems, in other words is it the problem with the update itself or the update process?

Read on for several helpful tips, as well as dead ends to avoid.

Comments closed

Script Update Mode Should Be Parallel

Andy Levy explains why he wants script update mode to run in parallel:

When you have 8000+ databases on an instance, this is a huge deal. You’re looking at over two and a half hours just to bring SQL Server online after installing an SP or CU. While the instance is in script update mode, incoming connections are locked down and the service remains in the Starting status. Only the Dedicated Administrator Connection can be used to connect to the instance remotely.

Taking advantage of having a Failover Cluster Instance to patch the passive node in advance is great for minimizing downtime for Windows updates. But whether you have an FCI or not, SQL Server will remain in the “Starting” state until all of your databases have been through this process after installing an update. What was once a 10-minute failover is now a multi-hour ordeal, and maintenance windows become a lot harder to negotiate.

Andy’s pretty far over on the right-hand side of that Bell curve, but I like his SQL Server suggestion because even with just a few hundred or a couple thousand databases, you’re still talking real time savings.

Comments closed

Improving Spark Auto-Scaling On ElasticMapReduce

Udit Mehrotra explains some of the ways Amazon ElasticMapReduce reduces the pain of node loss in Spark jobs:

The Automatic Scaling feature in Amazon EMR lets customers dynamically scale clusters in and out, based on cluster usage or other job-related metrics. These features help you use resources efficiently, but they can also cause EC2 instances to shut down in the middle of a running job. This could result in the loss of computation and data, which can affect the stability of the job or result in duplicate work through recomputing.

To gracefully shut down nodes without affecting running jobs, Amazon EMR uses Apache Hadoop‘s decommissioning mechanism, which the Amazon EMR team developed and contributed back to the community. This works well for most Hadoop workloads, but not so much for Apache Spark. Spark currently faces various shortcomings while dealing with node loss. This can cause jobs to get stuck trying to recover and recompute lost tasks and data, and in some cases eventually crashing the job. 

Auto-scaling doesn’t always mean scaling up.

Comments closed

SparkSession and its Component Contexts

The folks at Hadoop in Real World explain the difference between SparkSession, SparkContext, SQLContext, and HiveContext:

SQLContext is your gateway to SparkSQL. Here is how you create a SQLContext using the SparkContext.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Once you have the SQLContext you can start working with DataFrame, DataSet etc.

Knowing the right entry point is important.

Comments closed

Making Near-Zero Downtime Deployments Easier

I continue my series on developing for near-zero downtime deployments:

By default, SQL Server uses pessimistic locking, meaning that readers can block writers, writers can block readers, and writers can block writers. In most circumstances, you can switch from Read Committed to Read Committed Snapshot Isolation and gain several benefits. RCSI has certainly been in the product long enough to vet the code and Oracle has defaulted to an optimistic concurrency level for as long as I can remember.

The downtime-reducing benefit to using RCSI is that if you have big operations which write to tables, your inserts, updates, and deletes won’t affect end users. End users will see the old data until your transactions commit, so your updates will not block readers. You can still block writers, so you will want to batch your operations—that is, open a transaction, perform a relatively small operation, and commit that transaction. I will go into batching in some detail in a later post in the series, so my intent here is just to prime you for it and emphasize that Read Committed Snapshot Isolation is great.

Now that I have the core concepts taken care of, the next posts in the series move into practical implementation examples with a lot of code.

Comments closed

Spark RDDs and DataFrames

Ayush Hooda explains the difference between RDDs and DataFrames:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.

Before exploring these APIs, let’s understand the need for these APIs.

I like the piece about RDDs being better at explaining the how than the what.

Comments closed

Searching Complex JSON With SQL Server

Bert Wagner gives us a way that you can quickly search through complicated JSON:

Computed column indexes make querying JSON data fast and efficient, especially when the schema of the JSON data is the same throughout a table.

It’s also possible to break out a well-known complex JSON structure into multiple SQL Server tables.

However, what happens if you have different JSON structures being stored in each row of your database and you want to write efficient search queries against all of the rows of your complex JSON strings?

Bert’s solution is an example of a phenomenon I’ve noticed in relational databases: sometimes, the best solution is not the most straightforward. The most straightforward solution is to take the JSON as-is, but that hits a wall as Bert shows. Reshaping the data leads to much better performance…as long as you’re able to afford the time needed to reshape and don’t have JSON changing that frequently.

Comments closed

Postgres Tooling: Rant and Recommendations

Ryan Booz is not pleased with the current state of tooling for Postgres:

/* Begin Brief Soapbox*/
Honestly, this is by far one of my biggest grips about Open Source software now that I’m older, busier, and don’t want to spin my wheels trying to make something simple work. When the tools make it hard to dig in and work effectively with the database, most developers and shops will default to code-first/ORM only development. In nearly 20 years of software development and leading multiple teams, I’m still surprised how little most developers really care about effectively using a database of any kind. During most interviews only about 30% of applicants can ever answer a few basic SQL questions. And now I think I’m starting to understand why. Most of them have been relegated to an Open Source world with Open Source tooling when it comes to SQL. Yes, it’s cheap and allows projects to spin up quickly, but once those students get past their little pizza ordering app from CompSci 402, they’ll be lost in the real world.
/* End Brief Soapbox */

I completely agree with the tooling point. Having worked with Postgres and MySQL a little bit makes me appreciate Management Studio (for all its flaws) all the more. If you want Azure Data Studio to support Postgres, there’s a GitHub issue that you can vote up.

Comments closed