Kevin Feasel – Page 205

Parquet Files in Pandas

Published 2024-07-08 by Kevin Feasel

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Pandas does make this rather easy, as Chris shows.

Comments closed

Merge Join vs Hash Join in Postgres

Published 2024-07-08 by Kevin Feasel

Andrei Lepikhov compares two physical join operators:

Today’s post is sparked by a puzzling observation: users, especially those who use an abstraction layer like REST or ORM library to interact with databases, frequently disable the MergeJoin option across the entire database instance. They justify this action by citing numerous instances of performance degradation.

Considering how many interesting execution paths MergeJoin adds to the system elaborating IncrementalSort or sort orderings derived from underlying IndexScan, it looks strange: one more bug of skewed cost balance inside the PostgreSQL cost model?

This is an interesting peek into how complex the query optimizers in database engines are, as well as how small amounts of information (via statistics or indexes) can matter to a query.

Comments closed

Cross-Workspace Data Transfer in Microsoft Fabric

Published 2024-07-08 by Kevin Feasel

Reitse Eskens moves some data around:

When you open Fabric, the first thing you need to do is choose a so-called workspace. This serves as a container for all your Fabric items. You can have one or more workspaces and the design is entirely up to you. From one workspace to rule them all to one workspace for each set of items (Lakehouse, Warehouse, Semantic model and Report, Pipeline, Notebook etc). Until yesterday (the day this blogpost came online) it was impossible to use a pipeline to get data across different workspaces.

You could work around it with tricks like shortcuts, but it feels more natural (or maybe I’m just old ;)) to be able to read data from workspace 1 and write it into workspace 2.
So let’s see how this works and, where capacity is used!

Click through to see it in action.

Comments closed

Antipattern: DAX Measures Never Returning Blank

Published 2024-07-08 by Kevin Feasel

Chris Webb explains the value of BLANK:

Following on from my earlier post on the Query Memory Limit in Power BI, and as companion to last week’s post on how a DAX antipattern using Calculate() and Filter() can lead to excessive memory consumption by queries (and therefore lead to you hitting the Query Memory Limit), in this post I want to look at the effects of another DAX antipattern on performance and memory usage: measures that can never return a blank value.

Read on to see how much of a difference using DAX to fill a grid with 0’s can make.

Comments closed

Limiting Jobs to the Primary Replica of an AG

Published 2024-07-08 by Kevin Feasel

Chad Callihan doesn’t want jobs running willy-nilly:

Transitioning from a failover cluster configuration to an Availability Group configuration brings with it all kinds of “fun” challenges. One such challenge that you may not have considered is the handling of jobs on whatever server is Primary, along with secondary servers. Let’s briefly discuss a potential challenge and an option to address it.

Click through for the example and a solution. Eitan Blumin has another solution in the comments, so check that one as well.

Comments closed

Dealing with Query Store in Error State

Published 2024-07-05 by Kevin Feasel

David Fowler turns it off then back on again, like a true IT professional:

I recently received a complaint that Query Store for a particular database was turned off, which was strange as that particular database has seen quite a few performance issues and I know that I’d ensured Query Store was enabled in the past.

No problem, I flicked the switch and Query Store was enabled again.

Half an hour or so later and I’m being told that Query Store is again disabled. What’s going on?

Read on to learn what to do if you get stuck with this problem.

Comments closed

Modifying Column Return Order in sp_QuickieStore

Published 2024-07-05 by Kevin Feasel

Josephine Bush demands order:

I love QuickieStore, but I wanted some columns to be at the front end of the results returned. Namely, I wanted top_waits, query_sql_text, and query_plan right after database name. This way I don’t have to scroll over to see those values.

Unfortunately, it would appear that there’s no advanced functionality for column ordering like we have for sp_whoisactive. But that didn’t deter Josephine, and you can grab a copy of an updated script that includes columns in this different arrangement.

Comments closed

Real-Time Intelligence in Microsoft Fabric

Published 2024-07-05 by Kevin Feasel

Dennes Torres takes a peek at a service with a new name:

When everyone starts to announce Real-Time Intelligence in Microsoft Fabric as something new, I need to double check what’s happening: Am I crazy or is everyone else? Wasn’t this already there?

Finally, I realize that Real-Time Intelligence is a new name for Real-Time Analytics, and they are doing this so fast we don’t even have time to notice the difference.

What’s Real-Time Intelligence and what’s the difference from Real-Time Analytics?

Read on for those answers.

Comments closed

Performance Testing Microsoft Fabric Dataflow Gen2

Published 2024-07-05 by Kevin Feasel

Reitse Eskens hammers away:

In my previous blogs, I’ve been hammering Fabric with data from some different angles. Either with the Copy dataflows, notebooks, Pipelines, Data Warehouse SQL scripts or in PowerBI.
This time, I’m going to make the dataflow Gen2 work for it’s money.

Reitse tries the normal mechanism for Dataflows Gen2, but then also tries out a preview feature for fast copy and sees a marked difference.

Comments closed

RIP Stretch DB

Published 2024-07-05 by Kevin Feasel

Debbi Lyons calls it:

Ever since Microsoft introduced SQL Server Stretch Database in 2016, our guiding principles for such hybrid data storage solutions have always been affordability, security, and native Azure integration. Customers have indicated that they want to reduce maintenance and storage costs for on-premises data, with options to scale up or down as needed, greater peace of mind from advanced security features such as Always Encrypted and row-level security, and they seek to unlock value from warm and cold data stretched to the cloud using Microsoft Azure analytics services.

During recent years, Azure has undergone significant evolution, marked by groundbreaking innovations like Microsoft Fabric and Azure Data Lake Storage. As we continue this journey, it remains imperative to keep evolving our approach on hybrid data storage, ensuring optimal empowerment for our SQL Server customers in leveraging the best from Azure.

This is not surprising at all, considering that the premise of Stretch DB was that you could off-load old and less-important data from your local SQL Server instances and expensive local disk into Azure, querying it when you need that data. The problem was, you couldn’t use cheap storage and pay a few cents per gigabyte of data per month. Instead, you were effectively spinning up Azure Synapse Analytics and paying a marked premium for your least important data. The price alone made this an untenable idea, but there were other holes in the plan as well that doomed it as a product.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Author: Kevin Feasel