Press "Enter" to skip to content

Month: March 2023

DirectQuery Support for ApproximateDistinctCount DAX Function

Chris Webb has an update for us:

Some good news for those of you using DirectQuery mode in Power BI: the ApproximateDistinctCount DAX function, which returns an estimate of the number of the distinct values in a column and which can be a lot faster than a true distinct count as returned by the DistinctCount function, is now available to use with BigQuery, Databricks and Snowflake sources. It only worked with Azure SQL DB and Synapse before; RedShift is coming soon. You can use it in exactly the same way that you would with the DistinctCount function except that it only works in DirectQuery mode.

As always, there’s an example. I do wonder if the DAX function uses the same HyperLogLog algorithm that SQL Server uses for its approximate count distinct.

Comments closed

Power BI Scanner API Updates

Matthew Roche has an update for us:

Power BI includes capabilities to enable users to understand the content they own, and how different items relate to each other. Sometimes you may need a custom “big picture” view that built-in features don’t deliver, and this is where the Scanner API comes in.

Read on to learn what the Power BI Scanner API is and some of the most interesting updates. Matthew also has a link to the announcement with a full set of updates.

Comments closed

A Critique of XML

Andy Leonard isn’t XML’s biggest fan:

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

  1. XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
  2. Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
  3. That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
  4. And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
  5. Basically, this is why we can’t have nice things.
1 Comment

Spark Application Dependency Caching

Shu Wang, Biao He, and Minchu Yang talk turkey about dependencies:

In this blog post, we will share our analysis of Spark Dependency Management at LinkedIn, highlight interesting findings, and present our design choice of using a simple user-level cache over more complex alternatives. We will also discuss our rollout experience and lessons learned. Finally, we will demonstrate the impact of accelerating all Spark applications at LinkedIn at the cluster level. Faster Spark jobs translate to increased data freshness, leading to an enhanced member experience by way of more relevant recommendations, timely insights, effective abuse protection, and other similar improvements.

If you work with Spark to any serious extent, you’ll want to read this post.

Comments closed

Type 2 Dimension Loading in Redshift

Vaidy Kalpathy takes us through a bit of dimensional modeling in AWS Redshift:

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

Click through for the article.

Comments closed

Architectural Erosion and Technical Debt

Uli Homann and Eric Charran (via Ben Brauer) talk about the concept of architectural erosion:

The way Eric thinks about architectural erosion is when architects and engineers work together, they construct a system or solution. They have launched that solution into production. It’s performing well for a period of time and then change happens. Whether it’s a change in requirements, a change in infrastructure, or a change in customer habits, DevOps signals saying that people are using a certain feature versus they’re not using a certain feature. What ended up happening is there’s a moment in time in which we look at velocity around how do we implement this change and make the applications experience, do what the customer or the end user wants as quickly as possible. Then there’s the strategic picture of managing things like technical debt, which is if I do something tactical, I’m probably going to do it fast and cheap and not necessarily the “right way.” This then accrues to the architectural patterns, longevity and scalability and all those other types of things, and then that goes into my pile of technical debt.

Read on to learn more about the topic and what we, as technical professionals, can do to mitigate this risk.

Comments closed

Try Purview (Almost) for Free

Wolfgang Strasser wants to try Microsoft Purview but doesn’t want to break the bank:

And my reaction was – Nice, very nice.. I can try and create Microsoft Purview instances for free and test new features..

BUT: I wanted to be sure and check, how much metadata (sources, scan results, data assets, classifications) can fit into 1 MB of metadata.

Read on for Wolfgang’s test, as well as the full set of costs around trying out Purview.

Comments closed

Building Your Own TRY_PARSE Function

Aaron Bertrand gives it a go:

A while back, I wrote a couple of tips about simulating TRY_CONVERT – a function added in SQL Server 2012 – in earlier versions (see Part 1 and Part 2). Recently, someone in the community had a similar requirement: they wanted to make TRY_PARSE work for dates in SQL Server 2008. Now, I’m not one to help people stay on versions of SQL Server that are now 15 years old, but I do acknowledge that hands may be tied, and they can’t always control the version they’re stuck with.

Aaron does this community member a solid, though it’s probably something you’d never want to use if you’re on SQL Server 2012 or later.

Comments closed

The Importance of Trace Flag 460

Jonathan Kehayias helps me understand that Trace Flag 460 is one of the best trace flags ever:

For most people that are reading this post, I’d venture to guess that you have no idea what Trace Flag 460 is or when you would use it. Well first off, let me tell you it’s a fully documented and supported trace flag, and it’s totally safe. In fact, on SQL Server 2019 and higher it is the default behavior when you create a new database. What does it do? It makes troubleshooting string or binary truncation issues easier by changing the error message that is returned from message ID 8152 and replaces it instead with message ID 2628. The trace flag is also available in SQL Server 2016 SP2 CU6+ and SQL Server 2017 CU12.

Read on to learn more about it and to remove a potential bit of confusion in the documentation.

Comments closed

Working with IP Addresses in Powershell

Bill Kindle takes us through several Powershell cmdlets:

A common SysAdmin task involves managing a PowerShell IP configuration. Typically, one performs this task using the GUI, which is OK. But PowerShell can do the same job and more, which you can put into scripting.

In this tutorial, you will learn how to use a few network management cmdlets in PowerShell to help manage a Windows host’s IP address, gateway, and DNS settings.

Understanding how to do this becomes even more important if you’re running Windows Server Core, where you don’t have too many choices other than rolling with Powershell.

Comments closed