Press "Enter" to skip to content

Month: September 2020

Renaming Cached DataFrames in Spark

Landon Robinson works around an annoyance:

But DataFrames have not been given the same, clear route to convenient renaming of cached data. It has, however, been attempted and requested by the community:

https://forums.databricks.com/questions/6525/how-to-setname-on-a-dataframe.html
https://issues.apache.org/jira/browse/SPARK-8480

However, with the below approach, you can start naming your DataFrames all you want. It’s very handy.

Read on to see the solution in action.

Comments closed

A Kafka Alternative for Message Prioritization

Ricardo Ferreira explains why Apache Kafka doesn’t support message prioritization and gives a second-best solution when you do need it:

Users of messaging technologies such as JMS and AMQP often use message prioritization so that messages can be processed in a different order based on their importance. It doesn’t take much imagination to see where this can be useful—call center companies will want to handle the most severe cases first, then others; airline companies give preference to their service treats to customers with higher status, and telecommunication companies would probably give their most loyal customers a better renewal promotion.

One of the misconceptions that developers have when they first encounter Apache Kafka® is that it is just another messaging system, just like the ones they’re familiar with. By extension from that, they get disappointed when they find that Kafka doesn’t offer message prioritization.

Read on for the answers.

Comments closed

Thinking about Temporary Stats on Snapshots

Lonny Niederstadt shares an interesting scenario:

Consider a snapshot database which is created daily.  The purpose is to provide analytics reporting access while maintenance or data loads take place in the source database.  In the snapshot database, analytics reports have no locking concerns from the activity in the underlying source database.  And the temporary statistics provided by SQL Server, combined with the statistics inherited from the source database, provide a lot of information to the optimizer for query plan selection.
But what if significant fact tables are queried in the snapshot and leave a situation like col2 in stats_test?  A column which generates an auto-created stat in the snapshot, but never gets a statistic created in the source database.  Each day, the cost of creating that statistic and every statistic like it will be paid as part of the workload.  Even if the underlying table is a now-stable dimension. 

Click through for the demonstration.

Comments closed

The Importance of Specifying Nullability

Josh Darnell points out a case where not specifying nullability on columns can lead to confusion:

I always indicate whether a column allows NULL or not in CREATE TABLE and ALTER TABLE statements.

Or, you know, I’m in the habit of doing that.

Like, I try to be in the habit.

I’m doing my best, okay?

Anyway, it’s good to be clear about nullability, because you can run into surprises when you aren’t. 

Click through for a surprise.

Comments closed

Managing the SQL Server Error Log

Guy Glantser has some tips for managing the SQL Server error log:

SQL Server maintains its own log, also called “SQL Server Error Log”. This log contains messages describing informational and error events, similar to messages that you can find in Windows logs. In fact, many of the messages found in the SQL Server Error Log can also be found in the Windows Application Log. The SQL Server Error Log is a great place to find information about what’s happening on your database server.

SQL Server uses 7 log files to store these messages. One file serves as the current log file, and every new message is written to that file. The other 6 files are archived files, and they contain previous messages. Each time SQL Server is restarted, it recycles the files. What does it mean? First, it means that a new log file is created and becomes the new current log file. Second, the oldest log file (“Archive #6”) is deleted. And third, all the other log files are pushed back one step. The previous current log file becomes “Archive #1”, the previous “Archive #1” log file becomes “Archive #2”, and so on.

Read on to see how you can change this, manage the size of log files, and retain data for a longer time.

Comments closed

Performance Impacts of Computed Columns

Robert Sheldon takes us through a few scenarios:

In this article, I walk you through the process of applying these strategies so you have a better sense of the available options. For the examples, I created four similar tables and populated them with identical data, which comes from the WideWorldImporters sample database. Each table includes the same computed column, with the column persisted in two tables and indexed in two tables, resulting in the following mix:

– The Orders1 table includes a non-persisted computed column.
– The Orders2 table includes a persisted computed column.
– The Orders3 table includes an indexed, non-persisted computed column.
– The Orders4 table includes an indexed, persisted computed column.

For each table, I show you the execution plan that’s generated when querying the computed column. The column’s expression is a relatively simple one, and the data set very small. Even so, this should be enough to demonstrate the principles of creating persistent and indexed computed columns and how they can help address performance-related issues.

Click through for Robert’s results.

Comments closed

Finding the earliest and latest date across all Power BI Tables

Soheil Bakhshi has an interesting challenge:

Many of you may already thought that we can use CALENDARAUTO() in DAX and we are good to go. Well, that’s not quite right. In many cases there are some Date or DateTime columns that must not be considered in our Date dimension. Like Birth Date or Deceased Date. More on this later in this post.

In this post I share a piece of code I wrote for myself. I was in a situation to identify the Start Date and the End Date of the date dimension many times, so I thought it might help you as well.

Read on for a solution when not all date columns are relevant.

Comments closed

Passing an Array of Arrays as a Parameter in Azure Data Factory

Rayis Imayev has a list for us:

In my previous blog post – Setting default values for Array parameters/variables in Azure Data Factory, I had helped myself to remember that arrays could be passed as parameters to my Azure Data Factory (ADF) pipelines. This time I’m helping myself to remember that an array of other arrays can also exist as ADF pipeline parameters’ values.

Read on for the example.

Comments closed

Indexing S3 Data with CDP Data Hub

Eva Nahari, et al, show how to perform indexing and serving of S3 data in Cloudera Data Platform:

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. For the curious: DDE is a pre-templeted Solr-optimized cluster deployment option in CDP, and recently released in tech preview. We will only cover AWS and S3 environments in this blog. Azure and ADLS deployment options are also available in tech preview, but will be covered in a future blog post.

We will depict the simplest scenario to make it easy to get started. There are of course more advanced data pipeline setups and more rich schemas possible, but this is a good starting point for a beginner. 

Read on for the instructions.

Comments closed

A Review of KaDeck: Kafka Management Tool

Guy Shilo reviews KaDeck:

If you want the real web version then you need to register in their website and create a team first. Yes, KaDeck is aimed at teams of people working together. You create a team online and shortly you receive a mail with team id and a secret key that you will use when running KaDeck web..

It seems that there is no host installation of KaDeck right now (rpm package, binaries or so) but only a Docker image. This forces the user to run it in Docker or a Kubernetes cluster (or one of it’s commercial distributions such as OpenShift). Another thing is that checks your license online each time you start the container. If you want to use t offline you have to do a process of offline activation. You can reach it from the administration menu.

Click through for Guy’s thoughts on the product.

Comments closed