Press "Enter" to skip to content

Month: January 2024

Fixing Eager Spooling

Erik Darling sends it to the moon:

Probably the most fascinating thing about Eager Index Spools to me is how often the optimizer will insert them into execution plans, often to a query’s detriment.

In a sane world, a non-loop join plan would be chosen, a missing index request would be registered that matches whatever would have been spooled into an index, and we’d all have an easier time.

Read on for a few examples of the problem and two separate ways you can fix it. Remember, kids: friends don’t let friends eagerly spool.

Comments closed

Ensembling Churn Prediction Techniques

Salman Khan gloms together multiple trained models to solve a churn prediction problem:

Historically, this domain has leaned on traditional statistical models, including logistic regression and decision trees. These methodologies sift through historical customer data to identify indicators predictive of future service discontinuation. Although these methods have demonstrated resilience over time, their adequacy is increasingly being questioned. In this regard, ensemble learning emerges as a sophisticated alternative, offering enhanced precision and reliability in identifying potential customer attrition.

Ensemble learning, in turn, distinguishes itself by simultaneously employing multiple predictive models to refine accuracy. This article, thus, aims to elucidate how ensemble learning can revolutionize the approach to churn prediction: we will explore various techniques such as Random Forest, Gradient Boosting, and Stacking, illustrating their efficacy in predicting customer churn through pragmatic examples.

Read on for an introduction to ensemble learning and some high-level tips to keep in mind when ensembling.

Comments closed

Running Spark Jobs on Databricks with Spark Connect and .NET

Ed Elliott runs a Databricks job:

This post aims to show how we can create a .NET application, deploy it to Databricks, and then run a Databricks job that calls our .NET code, which uses Spark Connect to run a Spark job on the Databricks job cluster to write some data out to Azure storage.

In the previous post, I showed how to use the Range command to create a Spark DataFrame and then save it locally as a parquet file. In this post, we will use the Sql command, which will return a DataFrame or, in our world, a Relation. We will then pass that relation to a WriteOperation command, which will write the results of the Sql out to Azure storage.

The code is available HERE

Read on for the description of how everything works.

Comments closed

Firewalls and TLS in SQL Server on Linux

I have a new video out:

In this video, we harden our SQL Server instance in two ways: by using a firewall to limit inbound traffic, and by using a certificate to force encrypted connections to SQL Server.

This was a video I enjoyed creating. It also shows the progress of SQL Server security: go back to 2005 (pre-SP1) and even SQL authentication over TDS was unencrypted by default. They fixed it so that the authentication would use a self-signed cert but the data you’d get back from query results was unencrypted. Nowadays, encryption is easy (if you’re okay with a self-signed cert) and some future version of SQL Server will make it mandatory.

Comments closed

Architecting a Public-Facing Azure Container Registry

Kumar Ashwin Hubert and Rajesh Singh share an architecture with us:

This reference architecture describes the deployment of secured Azure Container Registry for consuming docker images and artifacts by customer applications over external (public internet) network.

This architecture builds on Microsoft’s recommended security best practices to expose private applications for external access. It utilizes the ACR’s token and scope map feature to provide granular access control to ACR’s repositories. Also, ACR internally uses the Docker APIs, and it is recommended to be familiar with these concepts before deploying this architecture.

I think this is a great example of the good and the bad of Azure architectures. The good is that you get a thoughful, well-explained, thorough description of the services you need and how they fit together, and there are a lot of those in the Azure Architecture Center. The bad is that, if I want to secure one container registry, I need a dozen different services. If we didn’t have this particular architecture diagram, I doubt 1 in 50 cloud specialists would come up with all of these services.

Comments closed

Redgate State of the Database Landscape Results

Louis Davidson review the results:

Every year, Redgate surveys technologists to ask a big question (through lots of little questions, naturally.) This year’s question was about their current data platform configuration and usage. Just before it was released, I read the results, and I have to say, some of the things I learned amazed me…until I thought a bit more about it.

Read on for what amazed Louis and then check out the survey results yourself.

Comments closed

Databases with Transaction Logs Larger than Data

Jess Pomfret checks database sizes:

This week I needed a query to find any databases where the transaction log is bigger than the total size of the data files. This is a red flag, and can happen for a few reasons that would need further investigation. However, this post is just to share the query, partly for you, and partly for future Jess.

If you do want to read more about why this could happen and how to fix it, Brent has a good post and some queries here: Brent Ozar – Transaction Log Larger than Data File.

Click through for the script and a quick example.

Comments closed

Fixing Low-Contrast Gradient Bar Charts in Power BI

Meagan Longoria looks at contrast:

Since conditional formatting was released for Power BI, I have seen countless examples of bar charts that have a gradient color fill. If you aren’t careful about the gradient colors (maybe you just used the default colors), you will end up with poor color contrast. Luckily there are a couple of quick (less than 30 seconds for most people to implement) fixes that can improve your color contrast.

Click through for a video demonstration and two tips from Meagan.

Comments closed

Bootstrapping in TidyDensity

Steven Sanderson pulls us up by the bootstraps:

Imagine this: You have a dataset, say, car mileage (MPG) from the classic mtcars dataset. You want to understand the average MPG, but what if that average is just a mirage? What if it’s skewed by a few outliers or doesn’t capture the full story?

Enter bootstrapping, a statistical technique that’s like taking your data on a wild ride. It creates multiple copies of your data, each with a slight twist, and then calculates the statistic you’re interested in (e.g., average MPG) for each copy. This gives you a distribution of possible averages, revealing the variability and potential biases lurking beneath the surface.

Read on to learn more about bootstrapping in general and how to use the bootstrap_stat_plot() function in TidyDensity.

Comments closed