Press "Enter" to skip to content

Day: October 10, 2024

Reading Parquet Files in R with nanoparquet

Stephen Turner reads some data:

In these slides I also learned about the nanoparquet package — a zero dependency package for reading and writing parquet files in R. Besides all the benefits noted above, parquet is much faster to read and write. And, as opposed to saving as .rds, parquet can easily be passed back and forth between R, Python, and other frameworks.

Let’s take a look at how reading and writing parquet files compares with CSV, either with base R or readr.

Stephen shows one of the best-case scenarios for Parquet: lots of data (100 million rows), relatively few columns, no long strings, etc. That leads to a massive improvement over using CSVs, even if you ignore the metadata and formatting benefits. I wouldn’t expect the benefits to be nearly as significant with wide text columns and very little value overlap, but that’s also pretty uncommon for the type of dataset we’re analyzing in R.

Comments closed

Adding Row Numbers to a SQL Query

Steve Jones enumerates a result set:

I’m going to use some fun data for me. I’ve been tracking my travels, since I’m on the road a lot. I’m a data person and part of tracking is trying to ensure I’m not doing too much. Just looking at the data helps me keep perspective and sometimes cancel (or decline) a trip.

In any case, you don’t care, but I essentially have this data in a table. As you can see, I have the date of travel, the city, area, etc. I also have a few flags as to whether I was traveling that day, if I spent a night away from home, and how far I was.

Read on for a few trials with ROW_NUMBER().

Comments closed

Cloud Connections in Microsoft Fabric

Dennes Torres makes a connection:

wrote about cloud connections when they were in a very early stage.

Cloud connections evolved and are now sharable. We call the “regular” connection as “personal connection”.

The problem with the “personal connections” is the difficult to make teamwork. The personal connections belong to you and different developers can’t use them. When a different developer needs to work with the same objects, they are required to create their own connection.

Using cloud connections, we can create a single, reusable connection to the data source and share it with all the developers in the team.

Read on to learn more about how they work now that the feature is a bit more mature.

Comments closed

Online DR from SQL Server 2022 and Azure SQL MI Now Available

Djordje Jeremic announces general availability of one of the key selling points from SQL Server 2022:

Today, we are announcing the general availability of the following two major capabilities of the Managed Instance link feature with SQL Server 2022:

  • Two-way failover between SQL Server 2022 and SQL Managed Instance through the link to unlock true disaster recovery (DR) with Azure
  • Creating a link from SQL Managed Instance to SQL Server 2022 to unlock off-PaaS data mobility for regulatory and dev/test scenarios 

Click through for more detail.

Comments closed

Permissions and BACPAC Files in Azure SQL DB

Roberto Yonekawa diagnoses an error:

We had a support request where the customer was getting an error when trying to export his Azure SQL Database to a bacpac file using the SqlPackage command-line utility.

Error message:

Microsoft.Data.Tools.Diagnostics.Tracer Error: 19 : 2024-08-21T16:10:56 : Microsoft.SqlServer.Dac.DacServicesException: One or more unsupported elements were found in the schema used as part of a data package.

Error SQL71627: The element Permission has property Permission set to a value that is not supported in Microsoft Azure SQL Database v12.

Click through for the specific issue Roberto found. I’d imagine that there are other permission sets that are incompatible with Azure SQL Database and would cause this error message to pop up as well.

Comments closed

Client Diagnostics in Cosmos DB

Arthur Daniels digs into diagnostic data:


This topic keeps coming up with my customers so the purpose of this blog post is to keep all the useful information in one post. Think of this post as a consolidation of different pieces of information. If you’re stuck on what “pipelined”, “created”, or “transit time” means, you’re in the right spot.

We are not talking about Diagnostic Settings/Log Analytics in this post, I’d describe those as server-side rather than client-side diagnostics. They are useful in different ways, our client diagnostics will help us understand the path that a request takes from the application to Cosmos DB and back, along with any latency on that path.

Note: before we get started, check to see if you’re using Direct or Gateway mode. Ideally sending your requests Direct mode in the .NET or Java SDKs will usually result in faster requests.

Read on to see how to retrieve and interpret this diagnostic data.

Comments closed