Press "Enter" to skip to content

Curated SQL Posts

Snowflake Purchases Streamlit

Alex Woodie reports on a purchase:

Cloud data warehousing giant Snowflake showed it’s serious about Python and data science this week when it announced that it plans to spend $800 million to buy Streamlit, a provider of Python-based tools for rapidly developing interactive data applications on the Web.

Co-founded in San Francisco in 2018 by Adrien Treuille, Amanda Kelly, and Thiago Teixeira, Streamlit develops an open source framework of the same name that allows data scientists and machine learning engineers to create and deploy data applications. The software is compatible with other Python-based frameworks, such as NumPy, Pandas, Matplotlib, and Scikit-learn, and uses React to render screens on the front-end.

Streamlit is nice. $800 million nice? That’s a good question.

Comments closed

Azure Data Studio Execution Plans

Hugo Kornelis is happy (for now):

But I am not writing this post to moan about past issues. I am writing this post because Microsoft has made huge improvements to execution plan support in ADS. These are officially still in preview, but they are already available. However, you will need to take a few steps to see these improvements in action.

Read on to see what you need to do and to get Hugo’s initial thoughts.

Comments closed

Power BI Decomposition Trees

Gauri Mahajan shows off decomposition trees in Power BI:

A large volume and variety of data generally need data profiling to understand the nature of data. One of the aspects of data is hierarchy and inter-relationships within different attributes in data. Hierarchical data is often nested at multiple levels. To analyze the relationship between different attributes in a data that is hierarchical, drill-down and drill-through are two of the most common techniques that are employed for data exploration as well as use-cases like root cause analysis. While these techniques are standard and have been in the industry for quite a long time, figuring out these relationships and navigating hierarchical data can be a challenging task. Data Analysts or Business Analysts typically perform this analysis on the data before presenting it to the end-users. In certain cases, some domain or business users may be required to perform such analysis on the report itself. In that case, the task becomes even more challenging considering the limited data analysis capabilities offered by a reporting tool compared to a database and query languages like SQL. To help power users perform such analysis on a reporting tool, visualizations like decomposition trees can be used to decompose hierarchical data that is presented in an aggregated manner. The Decomposition tree can support both drill-down as well as drill-through use-cases when the user is provided the flexibility to choose the hierarchy or dimensions on-demand. In the Microsoft technology stack, Power BI is the key reporting tool for authoring reports and supports a wide variety of data sources. Power BI offers a category of visuals which are known as AI visuals. One such visual in this category is the Decomposition Tree.

Read on to see how you can create a decomposition tree, what kind of information it shows, and how you can interact with it to learn more about correlations and causes.

Comments closed

Query Performance Insights on the Serverless SQL Pool

Jovan Popovic shows how you can use the QPI library on an Azure Synapse Analytics serverless SQL pool:

You can find more of the best practices here. These best practices are very important because some issues might cause performance degradation. You might be surprised how applying some of these best practices might improve the performance of your workload.

The last item that is related to schema optimization is sometimes hard to check. You would need to look at your schema, inspect all columns and find what to optimize. If you have a large schema, this might not be an easy task. But you can make your life easier if you use the QPI helper library that can detect schema issues for you.

Read on to see what it can find.

Comments closed

SQLBits Keynote Notes

Brent Ozar shares some thoughts from the first day’s keynote from SQLBits:

Pedro Lopes took the stage to talk about parameter-sensitive plan optimization, aka PSP Optimization. He demoed it with SQL Server 2022 CTP 1.3. I’ve written about this feature before, and there wasn’t anything new here in the demos. My opinion on this feature remains the same: I think it sounds like a phenomenal down payment. It won’t fix parameter sniffing, but I don’t think it’s going to backfire.

Read on for Brent’s thoughts around what Microsoft is doing for SQL Server 2022.

Comments closed

Trials and Tribulations of Maintaining CRAN Packages

John Mount explains the downside cost of CRAN being so useful:

If this automated email from a bulk sender bounces, goes to SPAM, or isn’t responded to quickly: your package will be archived or removed from CRAN. We’ve received these emails, and always acted on them quickly, out of fear.

The referred to check results are often not reproducible. For example, our most recent scare (that hasn’t yet triggered the email, and we have submitted a work-around before complaining here) was just “SUMMARY: processing the following file failed”, without details beyond the name of the failing file.

This is a tricky problem. On the one hand, as an end user of packages, I want packages playing nicely with each other. This is a lot better than Pip’s “Oh, sorry, you need version X but to install version X, it’ll break package Y as it needs < X” nightmare.

On the other hand, as a maintainer of a package, there’s a lot of added effort on a tight timeline for what is usually a volunteer effort.

I don’t have any CRAN packages I maintain and so I tend to be on the beneficiary side of things. But it’s important to keep those package maintainers in mind and one of the easiest ways to do that is to make explicit, reproducable bug reports. It may not make the deadlines more lax but at least that makes maintainers’ lives easier.

Comments closed

Apply Functions in R

Selina Cheng explains how the various apply() functions work:

Today I’m going to talk about a useful family of functions that allows you to repetitively perform a specified function (e.g., sum()mean()) across a vector, list, matrix, or data frame. For those of you familiar with ‘for’ loops, the apply() family often allows you to avoid constructing those and instead wrap the loop into one simple function.

I’m going to discuss the functions apply()lapply()sapply(), and tapply() in this blog post (as well as using the dplyr library for similar tasks). These functions all end in apply() because you apply the function you want across all the specified elements.

Read on to see how these functions work. H/T R-Bloggers.

Comments closed

Summarizing Data Mesh in Azure

Paul Andrew wraps up a series:

When we consider this in the context of what I’ve already established in part 1 of the series, I focused on our data products and ownership. Now I want to re-introduce our data domains as a level above our data products. We can even consider this a hierarchy.

– Data Domains

– Data Products

Why?

Read on for that answer.

Comments closed

Spools, Plus Memory Grants

Erik Darling continues looking at plan operators. Erik starts with spools:

Spools are temporary structures that get stuck over in tempdb. They’re a bit like temp tables, though they don’t have any of the optimizations and enhancements. For example, loading data into a spool is a row-by-row operation.

The structure that spools use varies a bit. Table spools use a “clustered index”, but it’s not built on any of the columns in your data. Index spools use the same thing, but it’s defined on columns in your data that the optimizer thinks would make some facet of the query faster.

Definitely a must-read and a good way of explaining things. In my words, spools aren’t necessarily a problem but if you have a problem, spools are often at the root.

Erik Darling is also Overdrawn at the Memory Bank:

Whoever called memory a “bank” was a smart cookie. Everything you get from RAM is a loan.

In SQL Server, queries can get memory loaned to them while they execute. The most common reasons for memory grants are Sorts and Hashes. You may also see them for an Optimized Nested Loops Join, but whatever.

Memory is such an important aspect of query and overall server performance that it really helps to understand when there’s pressure on it, and where it’s coming from.

Check out both.

Comments closed