Press "Enter" to skip to content

Day: May 4, 2022

Comparing Databricks to Synapse Spark Pools

Corrinna Peters makes comparisons:

There are different cases for using both depending on the specific needs and requirements, Synapse and Databricks are similar, but both have their own areas of specialities or rather areas where they are above the other.

Data Lake – they both allow you to query the data from the data lake, Synapse uses either the SQL on demand pool or Spark and Databricks uses the Databricks workspace once you have mounted the data lake. If you are predominately a SQL user and prefer the code and the BI developer feel then Synapse would be the correct choice whereas if you are a Data Scientist and prefer to code in Python or R then Databricks would feel more at home.

Read on for a nuanced take. My less nuanced take is, Databricks beats the pants off of Synapse Spark pools in terms of performance. Synapse has a much better overall ecosystem, expanding beyond Spark and into T-SQL (in two flavors) and log/event analytics with KQL. If you’re spending 100% of your time in Spark and don’t care about the rest, use Databricks; if Spark is a relatively small part of your warehousing work, use Synapse.

1 Comment

Trying out AutoML in R

JLaw calls a timeout:

In this fourth (and hopefully final) entry in my “Icing the Kicker” series of posts, I’m going to jump back to the first post where I used tidymodels to predict whether or not a kick attempt would be iced. However, this time I see if using the h2o AutoML feature and the SuperLearner package can improve the predictive performance of my initial model.

The results are just about what I would have expected: they provide a good floor but a human with knowledge of the data and skill with techniques can still beat out-of-the-box AutoML processes. Still, knowing what that floor is can help a lot: run some AutoML tool for a few minutes/hours/days and you have an easy way of letting the business side know the expected model quality. If AutoML already exceeds expectations, you’re golden. If AutoML is close to expectations (on either end, just above or just below), you as a skilled human should be able to improve things a bit more, especially once you have a chance to analyze what the AutoML processes did. If AutoML is way below business expectations of quality, perhaps this isn’t the best project to spend time on. H/T R-Bloggers.

Comments closed

Finding the Busiest Database

David Fowler looks for the biggest, meanest database on a SQL Server instance:

I’m sure we’ve all been there at some point, sometimes it’s easy to tell. We might only have a handful of databases on the server with one known to be the heaviest utilised. But sometimes things might not be so obvious, there could be a large numbers of databases or no obvious resource hog.

In those instances we need some way to figure out what how much time each database is spending on the CPU if that’s what we’re interested in or perhaps the total number of page reads or writes if IO is our problem.

Read on for one way to do this, assuming that the instance has been up long enough to give you reliable results.

Comments closed

Transparent Data Encryption in RDS

Tom Collins enables Transparent Data Encryption in AWS’s RDS:

Does AWS RDS SQL Server support built-in SQL Server Transparent Data Encryption (TDE)?  Yes , is the short answer , but there are some limitations  . 

Before I dive into the AWS SQL Server RDS TDW limitations , in this post  we’re discussing SQL Server TDE as opposed to RDS Encryption covering Aurora, MySQL, MariaDB, PostgreSQL, Oracle, SQL Server

Read on for the limitations Tom mentions.

Comments closed

Mistakes to Avoid when Creating Power BI Reports

Reza Rad shares some advice:

A relationship in Power BI is often for when we want a table to filter another table (based on the selection of items in the visuals). This filtering happens in the direction of the relationship. Meaning that if DimCustomer has a one-to-many relationship with the FactSales, and the relationship direction is single directional from the DimCustomer to FactSales, then DimCustomer can filter the FactSales, but not the other way around.

What we see a lot in the Power BI models is that the developer changes the relationship to both-directional so that each of the tables can filter the other table. This looks like a nice feature, but it will come at a big cost.

Click through for five good tips in blog format as well as written form.

Comments closed

Automated Partitioned Table Management

Eitan Blumin automates creation and deletion of partitions in SQL Server:

Before we begin, there are a few “ground rules” we should understand first:

1 – Partition Functions define the partition ranges

This means that whenever we want to eliminate an old partition range or add a new partition range, the PARTITION FUNCTION is the object that we actually need to modify.

Click through for Eitan’s entire process and a couple of scripts. This is an area that SQL Server could have made a lot easier, especially for periodic processes, by including options like “Daily” or “Monthly” or “Weekly(start on Monday)” for intervals rather than making people specify every partition separately.

Comments closed

Things Not to Include in Data Warehouses

Erik Darling compiles a list:

This is a list of things I see in data warehouses that make me physically ill:

– Unique constraints of any kind: Primary Keys, Indexes, etc. Make things unique during your staging process. Don’t make your indexes do that work.

Read on for the full list. I agree with everything except clustered row-store indexes. Those make a lot of sense on dimension tables, tied to the Kimball-style surrogate keys you create in the warehouse itself.

The other part I disagree with is non-clustered columnstore indexes, which I’ve rarely found good use for. Clustrered columnstore indexes are outstanding but the non-clustered variety…meh at best. This answer comes primarily because the pattern I tend to use for warehouse queries is to drive from the fact table, aggregate as much as I can there, and connect to the dimensions for further information at the end. If your warehouse access patterns differ radically from this, you might get more out of non-clustered columnstore indexes. Maybe.

Comments closed