2023-03-20 – Curated SQL

Applying Quality Assurance Practices to Data Science

Published 2023-03-20 by Kevin Feasel

The world runs on data. Data scientists organize and make sense of a barrage of information, synthesizing and translating it so people can understand it. They drive the innovation and decision-making process for many organizations. But the quality of the data they use can greatly influence the accuracy of their findings, which directly impacts business outcomes and operations. That’s why data scientists must follow strong quality assurance practices.

Read on for seven practices which can help data scientists achieve better outcomes.

Comments closed

The Legacy of Big Data

Published 2023-03-20 by Kevin Feasel

Adam Bellemare looks back:

Big Data was going to change the way everything worked. We were about to solve every financial, medical, scientific, and social problem known to humankind. All it would take was a great big pile of data and some way to process it all.

But somewhere along the line, the big data revolution just sort of petered out, and today you barely hear anything about big data.

Click through for Adam’s explanation, which is a more detailed form of “Some stuff worked out and became ubiquitous in other ways; others fell off the map.”

But I’m going to snag one more quotation here from Adam:

And finally, big data has shown us that no matter how hard we try, there’s simply no escaping from the inevitable convergence to a full SQL API.

Me: Laughs in Feasel’s Law.

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

1 Comment

Estimating and Managing Pod Spread in AKS

Published 2023-03-20 by Kevin Feasel

Joji Varghese talks pod distribution in Azure Kubernetes Service:

In Azure Kubernetes Service (AKS), the concept of pod spread is important to ensure that pods are distributed efficiently across nodes in a cluster. This helps to optimize resource utilization, increase application performance, and maintain high availability.

This article outlines a decision-making process for estimating the number of Pods running on an AKS cluster. We will look at pod distribution across designated node pools, distribution based on pod-to-pod dependencies and distribution where pod or node affinities are not specified. Finally, we explore the impact of pod spread on scaling using replicas and the role of the Horizontal Pod Autoscaler (HPA). We will close with a test run of all the above scenarios.

Read on for tips, as well as a few web tools, which you can use to estimate and control pod spread in AKS.

Comments closed

Tips for Using a Data Lakehouse

Published 2023-03-20 by Kevin Feasel

James Serra shares some advice:

As I mentioned in my Data Mesh, Data Fabric, Data Lakehouse presentation, the data lakehouse architecture, where you use a data lake with delta lake as a software layer and skip using a relational data warehouse, is becoming more and more popular. For some customers, I will recommend “Use a data lake until you can’t”. What I mean by this is to take the following steps when building a new data architecture in Azure with Azure Synapse Analytics:

Click through for six notes.

Comments closed

Parallelization in DirectQuery

Published 2023-03-20 by Kevin Feasel

Chris Webb shares some insight:

Recently we announced an important new optimisation for DirectQuery datasets: the ability to run (some) of the queries generated by a single DAX query in parallel. You can read the blog post here:

https://powerbi.microsoft.com/en-za/blog/query-parallelization-helps-to-boost-power-bi-dataset-performance-in-directquery-mode/

A few of us on the Power BI CAT team have tested this out with customers and seen some great results, so I thought I’d write a post illustrating the effect this optimisation can have and explaining when it can and can’t help.

Chris has examples of great success, as well as not-so-great success and utter failure, and explains the why behind each outcome.

Comments closed

Using Security Groups with Power BI Row-Level Security

Published 2023-03-20 by Kevin Feasel

Soheil Bakhshi has a recommendation for us:

However, managing RLS roles can be challenging if you have a large number of users or if your user base changes frequently. You need to manually assign each user account to one or more roles, which can be time-consuming and error-prone. Moreover, if a user changes their position or leaves the organisation, you must update their role membership accordingly.

This is where Security Groups become handy.

Soheil explains why and then gives us a step-by-step guide on what we can do to use security groups instead.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Day: March 20, 2023

Applying Quality Assurance Practices to Data Science

The Legacy of Big Data

Estimating and Managing Pod Spread in AKS

Tips for Using a Data Lakehouse

Parallelization in DirectQuery

Using Security Groups with Power BI Row-Level Security