Press "Enter" to skip to content

Day: March 13, 2023

The apply() Family in R

Steven Sanderson operates over a list of operators over lists:

In this post I will talk about the use of the R functions apply()lapply()sapply()tapply(), and vapply() with examples.

These functions are all designed to help users apply a function to a set of data in R, but they differ in their input and output types, as well as in the way they handle missing values and other complexities. By using the right function for your particular problem, you can make your code more efficient and easier to read.

I do prefer the purrr() syntax because it’s a little easier to remember its function names versus keeping the variants of apply() straight in your mind. Even so, there’s a lot you can do with a judicious use of apply().

Comments closed

The Story behind Benford’s Law

John Cook gives us a dose of history and math:

In 1881, astronomer Simon Newcomb noticed something curious. The first pages in books of logarithms were dirty on the edge, while the pages became progressively cleaner in later pages. He inferred from this that people more often looked up the logarithms of numbers with small leading digits than with large leading digits.

Why might this be? One might reasonably expect the numbers that came up in work to be uniformly distributed. But as often the case, it helps to ask “Uniform on what scale?”

Read on for a bit more of the story behind Newcomb’s Benford’s law and a just-so story about differing bases.

Comments closed

Tips for AKS Storage Provisioning

Joji Varghese gives us a hand:

In an Azure Kubernetes (AKS) cluster, Pods can access physical storage resources such as disks or volumes using Persistent Volumes (PV). To use these resources, Pods need to make a Persistent Volume Claim (PVC), which requests a specific amount of storage from a storage class. This claim can then be matched to an available Persistent Volume. Azure offers several storage solutions that can be used to provision Persistent Volumes in an AKS cluster.

This article will provide real-world guidance on securely using Container Storage Interface (CSI) drivers to provision Azure File Shares and Azure Blob storage in an AKS cluster.

If you’re looking at setting up Azure Kubernetes Service, give this a review.

Comments closed

DirectQuery Support for ApproximateDistinctCount DAX Function

Chris Webb has an update for us:

Some good news for those of you using DirectQuery mode in Power BI: the ApproximateDistinctCount DAX function, which returns an estimate of the number of the distinct values in a column and which can be a lot faster than a true distinct count as returned by the DistinctCount function, is now available to use with BigQuery, Databricks and Snowflake sources. It only worked with Azure SQL DB and Synapse before; RedShift is coming soon. You can use it in exactly the same way that you would with the DistinctCount function except that it only works in DirectQuery mode.

As always, there’s an example. I do wonder if the DAX function uses the same HyperLogLog algorithm that SQL Server uses for its approximate count distinct.

Comments closed

Power BI Scanner API Updates

Matthew Roche has an update for us:

Power BI includes capabilities to enable users to understand the content they own, and how different items relate to each other. Sometimes you may need a custom “big picture” view that built-in features don’t deliver, and this is where the Scanner API comes in.

Read on to learn what the Power BI Scanner API is and some of the most interesting updates. Matthew also has a link to the announcement with a full set of updates.

Comments closed

A Critique of XML

Andy Leonard isn’t XML’s biggest fan:

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

  1. XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
  2. Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
  3. That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
  4. And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
  5. Basically, this is why we can’t have nice things.
1 Comment