Press "Enter" to skip to content

Day: June 27, 2022

Software Engineering Practices for Notebooks

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

1. A report that runs regularly based on newer data and evolving business logic.

2. An ETL pipeline that needs to run on a regular schedule, or continuously.

3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

Semantics Layers for Data Lakehouses

Jans Aasman explains why semantic modeling is so important for a data lakehouse:

Data lakehouses would not exist — especially not at enterprise scale — without semantic consistency. The provisioning of a universal semantic layer is not only one of the key attributes of this emergent data architecture, but also one of its cardinal enablers.

In fact, the critical distinction between a data lake and a data lakehouse is that the latter supplies a vital semantic understanding of data so users can view and comprehend these enterprise assets. It paves the way for data governance, metadata management, role-based access, and data quality.

For a deeper dive into the topic, Kyle Hale has a post covering this with Databricks and Power BI as examples.

Comments closed

The First 30 Days as a DBA

Tracy Boggiano has some experience with new jobs:

Over the last four years, ok it seems longer than that, I’ve started four jobs. A couple just weren’t good fits. One I was at for three years. I currently just finished my first 30 days at my fourth one. Having done the first 30 days several times over the last few years, I’ve searched each time what you would do when you start that new position to take over the environment. What would you evaluate, where to start with everything, what to do first? With no luck mind you. So, I’m going to blog about my journey through this as I’ve done it several times over my career and believe it can help others as they start new positions know where to begin. Coincidentally, Aaron Bertrand (t) just blogged about his first month at Stack Overflow as a DBRE.

I can’t believe it’s been four years since Tracy and I worked together. Somebody’s been messing with my time machine, right?

Comments closed

Cumulative Updates and GDRs

Aaron Bertrand clarifies two concepts:

The underlying problem is that servicing complex software is, well, complex. Microsoft simplified this for our little corner of the world when they announced that SQL Server 2016 would be the last release to get service packs. We still have Cumulative Updates (CUs) and General Distribution Releases (GDRs) to deal with, but they tend to only cause confusion around Patch Tuesday (or the – cough – odd time a CU breaks things). Before I explain, let’s define these:

Read on for the definitions and why the GDR path exists.

Wait, I thought the German Democratic Republic (GDR / DDR) re-unified with the Federal Republic of Germany (FRG / BRD) in 1990… Ah, the lengths I go to for an awful joke.

Comments closed

Power BI Aggregations from Azure Data Explorer Data

Dany Hoter has some recommendations if you’re aggregating data from Azure Data Explorer into Power BI:

Every visual shown in a report in PBI, contains some form of aggregation

The question is how the aggregations are calculated and at which step in the pipe of bringing the data from the data source to the report.

In this article, I’ll be using data coming from Azure Data Explorer aka Kusto aka ADX.

Most of the content is relevant for other sources as well.

Read on for the advice, which I’d call fairly unexpected—I actually expected the recommendation to go the other way for performance reasons.

Comments closed

Building a Reproducable Example with SQL Server

Erik Darling wants to help you ask questions more effectively:

What no one wants to get into when it comes to performance questions is a giant wall-of-text word problem.

You may be the most eloquent question-asker in the known universe, but having the above items is worth hundreds of millions of words.

Click through to see what Erik would like to see. This is a really good post to read if you ever use Stack Overflow (or DBA Stack Exchange) or any other method of asking a bunch of randos how to solve a problem.

Comments closed

Power Query Online Memory and CPU Usage

Chris Webb wants to see how things are going:

Power Query Online is, as the name suggests, the online version of Power Query – it’s what you use when you’re developing Power BI Dataflows for example. Sometimes when you’re building a complex, slow query in the Query Editor you’ll notice a message in the status bar at the bottom of the page telling you how long the query has been running for and how much memory and CPU it’s using:

Read on to understand what the memory value indicates and for a few tips on the topic.

Comments closed

Stopping Azure Kubernetes Service Nodes

Andrew Pruski wants to shut the whole thing down:

A while back I wrote a post on Adjusting Pod Eviction Timings in Kubernetes. To test the changes made in that post I had to shut down nodes in an Azure Kubernetes Service cluster.

This can be done easily in the Azure portal: –

However I did a presentation recently and didn’t want to have to keep jumping into the portal from VS Code…so I wanted to be able to shut down the nodes in code.

So here’s how to use the azure-cli to shut down a node in an Azure Kubernetes Service cluster.

Read on to see how but also read Andrew’s warning / disclaimer so you don’t mess anything up in a production environment.

Comments closed