Press "Enter" to skip to content

Day: September 16, 2024

Building a GitHub Codespace Configuration for Polyglot Notebooks

Matt Eland makes some recommendations:

In order to get Polyglot Notebooks to work with GitHub Codespaces, you’ll need to match the current requirements of the Polyglot Notebooks extension and its underlying .NET Interactive kernels.

This relies on two files in your .devcontainer directory:

  • Dockerfile which describes the Docker container the Codespace will run in
  • devcontainer.json which describes how the dev container is configured in terms of extensions and ports

Read on to learn more. Also, Matt has a brand new book available on the topic of polyglot notebooks, so check that out.

Leave a Comment

Schema Validation in MongoDB

Robert Sheldon makes me bite my tongue to prevent making schema quality jokes:

In the previous article in this series, I introduced you to schema validation in MongoDB. I described how you can define validation rules on a collection and how those rules validate document inserts and updates. In this article, I continue the discussion by explaining how to apply validation rules to documents that already exist in a collection. Before you start in on this article, however, I recommend that you first review the previous article for an introduction into schema validation.

The examples in this article demonstrate various concepts for working with schema validation, as it applies to existing documents. I show you how to find documents that conform and don’t conform to the validation rules, as well as how to bypass schema validation when inserting or updating a document. I also show you how to update and delete invalid documents in a collection. Finally, I explain how you can use validation options to override the default schema validation behavior when inserting and updating documents.

Read on to learn more about how you can perform some after-the-fact schema validation.

Leave a Comment

The State of the ANY Aggregate Transformation

Paul White covers an aggregate operator:

SQL Server provides a way to select any one row from a group of rows, provided you write the statement using a specific syntax. This method returns any one row from each group, not the minimum, maximum or anything else. In principle, the one row chosen from each group is unpredictable.

The general idea of the required syntax is to logically number rows starting with 1 in each group in no particular order, then return only the rows numbered 1. The outer statement must not select the numbering column for this query optimizer transformation (SelSeqPrjToAnyAgg) to work.

Read on for information about this internal operator, a bug that existed in it for a long time, and the current state of fixes.

Leave a Comment

Working with Excel Files in Databricks

Chen Hirsh deals with truly big data:

Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.

Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!

Click through for a way to do this using PySpark. H/T Madeira Data Solutions blog.

Leave a Comment

Finding the SQL Power BI DirectQuery Mode Generates

Chris Webb finds a way:

If you’re performance tuning a DirectQuery mode semantic model in Power BI, one of the first things you’ll want to do is look at the SQL that Power BI is generating. That’s easy if you have permissions to monitor your source database but if you don’t, it can be quite difficult to do so from Power BI. I explained the options for getting the SQL generated in DirectQuery mode and why it’s so complicated in a presentation here, but I’ve recently found a new way of doing this in Power BI Desktop (but not the Service) that works for some M-based connectors, for example Snowflake.

Click through for the solution.

Leave a Comment

External References in Data-Tier Applications

Andy Brownsword needs to make a call out:

One method for transferring a database to a different environment is using a Data-Tier Application – in the form of a DACPAC (for schema) or BACPAC (for schema and data).

Trying to use this approach with multi-database solutions is a challenge though as Data-Tier Applications don’t play nicely with cross-database objects.

Let’s look at how we can ease that pain.

Read on for the solution.

Leave a Comment