Press "Enter" to skip to content

Month: May 2023

Microsoft Fabric vs Synapse

Nikola Ilic shares some thoughts:

I’ve already introduced Microsoft Fabric in the previous article, so if you’re still not sure what is it all about and why you can think of Fabric as your “data football team”, I strongly encourage you to check that article. Additionally, there are many great articles and videos, both from Microsoft and the community, where you can find out more about Fabric and its various scenarios and components.

In the above-mentioned article, I scratched the surface of the inevitable topic that now comes into focus: “What now for Azure Synapse Analytics?” Since I’ve been asked this exact question multiple times in the previous days, I’ve decided to put down my thoughts and share them in this article.

Read the whole thing. My thoughts, which are generally similar to Nikola’s:

  • There are no plans (at this time) to remove Synapse, and even if there were, prior history—like with Azure SQL DW—says that the deprecation timeframe is something we can measure in years rather than months
  • Fabric is intended to replace Synapse one of these days, and new customers should start with Fabric
  • Current Synapse customers should stay on Synapse for now, especially given that there is currently no easy migration plan. Give partners and Microsoft some time to sort that out, though, and I expect you’ll see tools and products for this by the time Fabric goes GA
  • PaaS and SaaS are quite different and that can be an influential factor. My personal preference is for SaaS, especially knowing how difficult it can be to secure Synapse while still enabling developer functionality
  • We’re on day 4 of Fabric being a thing (at least in public), and it’ll probably be in a public preview for a while, so there’s still plenty of baking left to do
Comments closed

Azure Data Studio 1.44 Released

Erin Stellato has an update for us:

It seems like just last week we were just releasing Azure Data Studio 1.43 (oh wait, that was last month) and here we are announcing Azure Data Studio 1.44!

This release comes in the midst of Microsoft Build, where numerous Copilot-adjacent announcements have been shared for different applications.  We are excited to announce that the GitHub Copilot extension is now available in the extension gallery with the Azure Data Studio 1.44 release!  For complete details and some examples, please see Drew’s post, Introducing GitHub Copilot for Azure Data Studio.

Click through for what’s new.

Comments closed

Private Preview of Native JSON Type for Azure SQL DB

Umachandar Jayachandran (or, as we all know him, UC) makes an announcement:

We are excited to announce the private preview of native JSON type and JSON_OBJECTAGG & JSON_ARRAYAGG aggregates in Azure SQL Database. The JSON type will allow you to store JSON documents in a native binary format that is optimized for storage and query performance. The ANSI SQL compatible JSON aggregates – JSON_OBJECTAGG & JSON_ARRAYAG will allow you to aggregate relational data and transform the data into JSON documents in a query.

I do have to admit that the native JSON type was a bit curious, given that they had assiduously rejected the notion of introducing a native JSON type for years, yet here we are. But if there are significant enough performance gains—and there can be by moving from text to binary JSON—it can be worth it. The XML type also allowed you to create indexes, which is probably easier to do with a native type.

Comments closed

Adding Count to a Grouped DataFrame in Spark

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Contrasting Kafka and Pulsar

Tessa Burk perform a comparson:

Apache Kafka® and Apache Pulsar™ are 2 popular message broker software options. Although they share certain similarities, there are big differences between them that impact their suitability for various projects.  

In this comparison guide, we will explore the functionality of Kafka and Pulsar, explain the differences between the software, who would use them, and why.  

Click through for that comparison. I haven’t used Pulsar before, so it’s interesting to get this sort of a functionality and community comparison.

Comments closed

Holding and Pruning WhoIsActive Results

Andrea Allred has a script for us:

Last month I talked about how I started collecting data from sp_whoisactive. The long term goal was to analyze and tune those long running processes that are sometimes hard to quantify in Query Store. We had started collecting the data in a table (whoisactiveLog), but wanted to make a simple table that our Engineers could refer to and find the long running processes. We also wanted to archive off the whoisactiveLog into another table and save that for 70 days in case it was needed for further research on the tuning of processes. Each night, we have a break in our processes where we can do this maintenance so we decided we would empty the whoisactiveLog table completely.

There’s also a check for session length in there.

Comments closed

Microsoft Fabric for the Power BI Practitioner

Kurt Buhler provides a nice graphic:

I’m just writing this quick article to share a visual overview I made of the newly announced Power BI-related features. I hope it helps you get an at-a-glance overview of some of the big changes relevant to Power BI. More importantly, I hope it doesn’t make you feel overwhelmed! For more information, check out the documentation and learning paths.

There is a lot in store for the platform, but you can already see a slew of new changes and opportunities for Power BI developers.

Comments closed

Trying out Parameter Sensitive Plan Optimization

Jared Poche gives it a try:

Parameter Sensitive Plan Optimization attempts to solve the parameter sniffing issue by allowing a query to have different plans that are used based on the cardinality of one important parameter. The parent query has a plan that is really just a stub; there can be up to three variant queries that have full execution plans. When the query is executed, the optimizer chooses which variant query and plan to use based on the cardinality of the parameter’s value.

Click through for examples. I share Jared’s thoughts about row counts, though I’m not a huge fan of just using those. The rule of thumb I like to apply (though I certainly didn’t make it up myself!) is that, if you’re retrieving 0.5% or less of a table, a seek is the best option, assuming there is an appropriate index. If you’re retrieving 20% or more of a table, a scan is the best option, regardless of index quality.

Going a little further, somewhere in that 0.5-20% range, there’s an in-between zone where you should be indifferent between seek and scan, as they’ll both perform approximately as well. But if the optimizer chooses “seek” for the in-between zone and you nudge up those returned row counts higher and higher, seek becomes less viable, and there may be a zone somewhere between X% (that mid-point of indifference) and 20% where you haven’t yet crossed the row count threshold for another plan but should switch over to the scan.

Coming up with the right solution to this problem would be pretty hard, and I’m not paid to solve problems. I’m not-paid to come up with problems, however.

Comments closed

Building a Lakehouse in Microsoft Fabric

Reza Rad builds a warehouse down by the river:

The term Lakehouse is derived from two other words; Data Lake and Data Warehouse. A Lakehouse is a place to store structured data (such as Data Warehouse) and unstructured data (such as a Data lake) in a single location. Lakehouse is capable of scaling up to handle large amounts of data. Other tools and services can be used to interact with the lakehouse, for example, to load or read data into it.

Click through for instructions on how to build one and how to access it from SSMS and Power BI.

Comments closed