Press "Enter" to skip to content

Day: December 7, 2021

Using Scala at Databricks

Li Haoyi gives us a peek behind the curtain:

With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes around managing our large Scala codebase. From this post, you’ll learn about everything big and small that goes into making Scala at Databricks work, a useful case study for anyone supporting the use of Scala in a growing organization.

It’s always interesting to see how the largest companies handle certain classes of problems. From this post, we can get an idea of the high-level requirements and usage, making it worth the read.

Comments closed

What Good Data Governance Means

Paul Andrew shares some thoughts on the true meaning of data governance:

… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.

To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.

Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.

Paul has put together a lengthy answer on the topic, well worth the read.

Comments closed

Batch Mode and Window Functions

I wind down a series on window functions:

SQL Server typically operates in row mode, which means that an operator processes one row at a time. This sounds inefficient, but tends to work out pretty well in practice. However, something which may work out even better is to process more than one row at a time, especially when the number of rows gets to be fairly large. Enter batch mode.

Batch mode was introduced in SQL Server 2012 alongside non-clustered columnstore indexes. It became interesting in SQL Server 2016 and very interesting in SQL Server 2019. That’s because 2016 introduced writable clustered columnstore indexes and 2019 gives us batch mode outside of columnstore indexes.

There are some nice potential performance gains for queries involving window functions.

Comments closed

tempdb Usage and Recommendations

Chad Callihan has a two-parter for us. First up is a look at tempdb usage:

The tempdb database in SQL Server holds a little bit of everything over its existence. There can be the temporary tables that you created, work tables like spools that SQL Server creates, version store data, and much more in tempdb at any time.

Is your tempdb having a hard time keeping up with the workload on your SQL Server? How can you find out what is using tempdb the most?

Chad also recommends picking the right number of files for tempdb:

We just made it through Thanksgiving dinner and there’s another big Christmas dinner coming up in few weeks. Sometimes you might find that one plate is not enough to hold all of that food. Just as multiple plates can make big dinners easier to manage, multiple tempdb data files can make the SQL Server workload easier to manage.

Read on to see how to track and modify this setting.

Comments closed

Modeling Many-to-Many Relationships in Power BI

Marco Russo and Alberto Ferrari show off two techniques:

Our readers know SQLBI position regarding bidirectional relationships: they are a powerful tool that should be used with great care and avoided in most scenarios. There actually is one scenario where bidirectional relationships are useful: when you need to create a model involving a many-to-many relationship between dimensions. In this scenario, using a bidirectional filter relationship is the suggested solution. Nonetheless, there may be reasons why the bidirectional relationship cannot be created, because of ambiguity. If you face this situation, you can use a different modeling technique based on a limited many-to-many cardinality relationship, which would work even when it is set as a unidirectional relationship. The choice between the two models is not an easy one. Both come with advantages and disadvantages that need to be deeply understood in order to make the right choice.

In this article, we first provide a description of the two techniques, and then we proceed with the performance analysis of both solutions, so to provide information about which technique to use and when.

Read on for the analysis.

Comments closed