2020-12-22 – Curated SQL

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.

Read on to see how the two differ.

Comments closed

Working with Serverless and Dedicated SQL Pools in Azure Synapse Analytics

Published 2020-12-22 by Kevin Feasel

Igor Stanko takes us through both dedicated and serverless SQL Pools in Azure Synapse Analytics:

Both serverless and dedicated SQL pools can be used within the same Synapse workspace, providing the flexibility to choose one or both options to cost-effectively manage your SQL analytics workloads. With Azure Synapse, you can use T-SQL to directly query data within a data lake for rapid data exploration and take advantage of the full capabilities of a data warehouse for more predictable and mission-critical workloads. With both query options available, you can choose the most cost-effective option for each of your use cases, resulting in cost savings across your business.
This post explores 2 consumption choices when exercising analytics using Synapse SQL (serverless and dedicated SQL pools) and examines the power and flexibility provided by Azure Synapse when both are used to execute T-SQL workloads. In addition, we will explore options to control cost when using both models.

Click through for details, including hints on minimizing costs.

Comments closed

Multiple Slicers and AND Logic

Published 2020-12-22 by Kevin Feasel

Stephanie Bruno embraces the healing power of AND:

When using slicers in Power BI reports, multiple selections filter data with OR logic. For example, if you have a slicer with products and your visuals are displaying total number of invoices, then when “bicycles” and “helmets” are selected in the products slicer your visual will show the number of invoices that include bicycles OR helmets. But what if you need to have it instead only show the number of invoices that include bicycles AND helmets? Read on to find out how you can do just that with DAX.

Read on for the solution.

Comments closed

Using Scala in a Databricks Notebook

Published 2020-12-22 by Kevin Feasel

Tomaz Kastrun take a look at the original Spark language:

Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.

Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.

Comments closed

SQL Server Diagnostic Notebook Updates

Published 2020-12-22 by Kevin Feasel

Emanuele Meazzo has updated the SQL Server Diagnostic Notebook:

Good news everyone!
As always, I’ve updated the SQL Server Diagnostic Notebook to include the latest updates to the source scripts

Click through for the link.

Comments closed

When Self-Joins Beat Lookups

Published 2020-12-22 by Kevin Feasel

Erik Darling shows us an interesting scenario:

Whether the lookup is Key or RID depends on if the table has a clustered index, but that’s not entirely the point.
The point is that there’s no way for the optimizer to decide to defer the lookup until later in the plan, when it might be more opportune.

Read on for the situation and the solution.

Comments closed

More ETL Antipatterns

Published 2020-12-22 by Kevin Feasel

Tim Mitchell continues a series on ETL anti-patterns. First up is lazy metadata:

Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.
Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement.

Read on to see if you suffer from lazy metadata. If so, ask your doctor if Metatol is right for you.

Next, the lack of error handling logic:

Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.

This is infuriating in how common it is, even among people who know better. Err, please don’t look at my ETL processes…

Finally, Tim reminds us that ETL logic is source code:

In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.
However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.

I think much of this comes from SSIS building out giant XML files which were essentially binaries. BIML went a long way toward allowing us to treat ETL with SSIS as a proper language and store code in source control, but even if you’re hand-editing SSIS packages, I agree with Tim completely.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: December 22, 2020

Repartitioning and Coalescing in Spark

Working with Serverless and Dedicated SQL Pools in Azure Synapse Analytics

Multiple Slicers and AND Logic

Using Scala in a Databricks Notebook

SQL Server Diagnostic Notebook Updates

When Self-Joins Beat Lookups

More ETL Antipatterns