Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark:

Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often a very beneficial idea.

However, a thorough understanding of distributed computing paradigms like Map-Reduce (a paradigm Apache Spark follows and builds upon) can help understand how files are created by parallelized processes. More importantly, one can learn the benefits and consequences of manipulating that behavior, and how to do so properly – or at least without degrading performance.

There’s good advice in here, so check it out.

Using the Cosmos DB Data Migration Tool

Kevin Feasel

2019-06-25

Cloud

Hasan Savran shows how you can use the Cosmos DB Data Migration Tool to move data from various sources into Cosmos DB:

All you need is the connection string of your database or location of your source files and your CosmosDB keys. If you are using a database as source, you can format the data model pretty easy. You might have issues if the source you use does not have a JSON data type. JSON Array might look like string in CosmosDB because of data type mapping problems.

I am going to use CSV file as source in the following example. You can’t define a column as JSON array in excel. I have the following two columns in my CSV file. If I import these values into CosmosDB as they are, I see coordinates field’s data type will turn into a text field in CosmosDB. You can go back and try to update them in CosmosDB but that will be an expensive solution.

The one downside to this tool is that it doesn’t work with collections defined using the Mongo API.

Creating an Azure Databricks Cluster

Brad Llewellyn shows how you can create an Azure Databricks cluster:

There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  We will dig into each of these in due time.  For this post, we’re going to talk about Clusters.  Clusters are where the work is done.  Clusters themselves do not store any code or data.  Instead, they operate the physical resources that are used to perform the computations.  So, it’s possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment.  Let’s start by creating a small cluster.

Read on for an example.

Using APPLY to Reduce Function Calls

Kevin Feasel

2019-06-25

T-SQL

Erik Darling shows a clever use of the APPLY operator:

A while back, Jonathan Kehayias blogged about a way to speed up UDFs that might see NULL input.

Which is great, if your functions see NULL inputs.

But what if… What if they don’t?

And what if they’re in your WHERE clause?

And what if they’re in your WHERE clause multiple times?

Oh my.

But fear not—Erik’s got you covered.

SQL Server 2019 CTP3 T-Log Writers Increased

Lonny Niederstadt observes a change in SQL Server 2019 CTP 3.0:

In SQL Server 2016, transaction log writing was enhanced to support multiple transaction log writers.  If the instance had more than one non-DAC node in [sys].[dm_os_nodes], there would be one transaction log writer per node, to a maximum of 4.

In SQL Server 2019, it seems the maximum number of transaction log writers has been increased.  The system below with 4 vNUMA nodes (and autosoftNUMA disabled) has eight transaction log writer sessions, each on their own hidden online scheduler, all on parent_node_id = 3/memory_node_id = 3 on processor group 1.

Click through for the proof.

Self-Documenting Power BI Apps

Matthew Roche wants to build self-documenting Power BI applications:

Power BI is constantly evolving – there’s a new version of Power BI Desktop every month, and the Power BI service is updated every week. Many of the new capabilities in Power BI represent gradual refinements, but some are significant enough to make you rethink how you your organization uses Power BI.

The new app navigation capabilities introduced last month to Power BI probably fall into the former category. But even though they’re a refinement of what the Power BI service has always had, they can still make your apps significantly better. Specifically, these new capabilities can be used to add documentation and training materials directly to the app experience, while keeping that content in its current location.

Click through for an explanation.

SSMS Query Plans and Arrow Sizes

Brent Ozar clarifies what arrow sizes actually mean in execution plans:

That means the entire concept of the arrow is made up by the rendering application – like SQL Server Management Studio, Azure Data Studio, SentryOne Plan Explorer, and all the third party plan-rendering tools. They get to decide arrow sizes – there’s no standard.

SSMS’s arrow size algorithm changed back in SQL Server Management Studio 17, but most folks never took notice. These days, it’s not based on rows read, columns read, total data size, or anything else about the data moving from one operator to the next.

There’s an answer, but it’s not particularly intuitive. I think SentryOne Plan Explorer has the upper hand on this one.

Categories

June 2019
MTWTFSS
« May Jul »
 12
3456789
10111213141516
17181920212223
24252627282930