Metacat: Federated Metadata Discovery

Ajoy Majumdar and Zhen Li walk us through Metacat:

The core architecture of the big data platform at Netflix involves three key services. These are the execution service (Genie), the metadata service, and the event service. These ideas are not unique to Netflix, but rather a reflection of the architecture that we felt would be necessary to build a system not only for the present, but for the future scale of our data infrastructure.

Many years back, when we started building the platform, we adopted Pig as our ETL language and Hive as our ad-hoc querying language. Since Pig did not natively have a metadata system, it seemed ideal for us to build one that could interoperate between both.

Thus Metacat was born, a system that acts as a federated metadata access layer for all data stores we support. A centralized service that our various compute engines could use to access the different data sets. In general, Metacat serves three main objectives:

  • Federated views of metadata systems
  • Unified API for metadata about datasets
  • Arbitrary business and user metadata storage of datasets

It is worth noting that other companies that have large and distributed data sets also have similar challenges. Apache Atlas, Twitter’s Data Abstraction Layer and Linkedin’s WhereHows (Data Discovery at Linkedin), to name a few, are built to tackle similar problems, but in the context of the respective architectural choices of the companies.

If you’re interested, also check out their GitHub repo.

Understanding A Spark Streaming Workflow

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

  1. Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.

  2. Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.

  3. Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Calculating TF-IDF Using Apache Spark

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

  • Document search.
  • Document tagging.
  • Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution.  It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

File Growth Rate Under 1MB

John Morehouse shows that the file growth rate GUI for Management Studio doesn’t report values under 1MB correctly:

While I was recently,  doing a review of a client’s environment I discovered that the GUI can lie to you when it comes to the database file growth rates.  By default, the data file is set to a 1MB growth rate and the log file is configured for a 10% growth rate.  Both are horrible settings for most OLTP environments.  However, starting with SQL Server 2016, the default growth rates are configured for 64MB, which in my opinion is better than the previous defaults.

Using the GUI to look at a 2017 Scratch database I have, we can see that the data file is configured for 1MB and the log file is set for 64MB growth.

I don’t think there’s a good reason for a file growth rate under 1 MB at this point.  That could have made sense in the late ’90s, but the idea of growing 128KB at a time is funny.

Running ssisUnit In MSTest

Bartosz Ratajczyk shows how to convert ssisUnit tests to work with the NUnit or MSTest frameworks:

MSTest v2 is the open source test framework by Microsoft. I will not write a lot about it. If you want to learn more – read the excellent blog posts by Gérald Barré.

I took the idea and parts of the code from Ravi Palihena’s blog post about ssisUnit testing and his GitHub repository. Then I read the source code of the SsisUnitTestRunnerSsisUnitTestRunnerUI and posts by Gérald and changed the tests a bit.

I will use MSTest to execute ssisUnit tests from the file 20_DataFlow.ssisUnit. For that, I created a new Visual C# > Test > Unit Test Project (.NET Framework) – ssisUnitLearning.MSTest – within the solution. I also set the reference to the SsisUnit2017.dll and SsisUnitBase.dll libraries and loaded required namespaces

Bartosz gives us the initial walkthrough, and then builds a T4 template to automate the task.  You can grab that template on his GitHub repo, and hopefully something makes its way into ssisUnit to make integration with NUnit / MSTest official.

Disambiguating “App” In Power BI

Melissa Coates gives us the different forms of what an “app” is in the Power BI world:

Let’s say you just heard someone mention a Power BI app. What exactly do they mean by that? Well, it depends. The term “app” is used kind of a lot in the Power BI world. So, here’s a quick reference to help you decode the conversation. I’m going to start with the most likely options, working down to other options. Which one someone is referring to really depends on their role and their level of familiarity with the Power BI ecosystem.

Power BI App

Power BI App is a packaged up set of content in the web-based Power BI Service. Related reports, workbooks, dashboards, and datasets are published from an App Workspace into an App for users to consume.

Power BI App Workspace

An App Workspace in the Power BI Service is where reports, workbooks, dashboards, and datasets are saved, and where data refresh schedules and other settings are defined. An App Workspace is suited to development & collaboration with coworkers (whereas My Workspace is a private area). Smaller teams might do everything they need to do within an App Workspace, whereas larger teams use an App Workspace as the collaboration area for content before it gets published to a Power BI App for consumption. You can have quite a few App Workspaces, depending on how you organize content (for instance, by subject area, by project, by department, or by type of analysis).

Click through for several other potential answers for what that user means by “app.”

Why Your Transaction Log Is Full: LOG_BACKUP

Jen McCown explains why you might get the error message “The transaction log for database ‘<your database>’ is full due to ‘LOG_BACKUP'”:

Your transaction log is full. Both Microsoft, and about 100 articles and blogs have covered this topic, but let’s take a quick look anyway. Because, you know, it comes up all the time.


  1. This error message points to a lack of log backups.

  2. Make sure using sys.databases.

  3. Start backing up the log.

  4. You can shrink the log if necessary.

  5. A note on SIMPLE mode, and why it’s often a terrible idea.

This is a good summary of the problem and various solutions.

Randomization With NEWID()

Michael J. Swart tests whether ORDER BY NEWID() produces a biased result:

One of his articles, Visualizing Algorithms has some thoughts on shuffling at

He says that sorting using a random comparator is a rotten way to shuffle things. Not only is it inefficient, but the resulting shuffle is really really biased. He goes on to visualize that bias (again, I really encourage you to go see his stuff).

Ordering by random reminded me of the common technique in SQL Server of ORDER BY newid(). So I wondered whether an obvious bias was present there. So I shuffled 100 items thousands of times and recreated the visualization of bias in a heat map (just like Mike did).

I’ve used this pattern to good effect, but definitely pay attention to Michael’s warning at the end.

Query Store And Plan Cache Flushes

Erin Stellato explains why enabling Query Store requires the plan cache be flushed:

I’ve had two comments recently on my blog about Query Store causing the plan cache to be flushed. There was a known issue related to the plan cache flushing after Query Store was enabled, but this was fixed in CU2 for SQL Server 2016 SP1. So I did some testing and here is what I think is causing the confusion:

When you enable Query Store, which is done with an ALTER DATABASE SET statement, the plan cache for the database is flushed.

Now, before anyone writes up a UserVoice item, understand that there are several ALTER DATABASE SET commands that cause the plan cache for a database to be flushed. For example, taking a database OFFLINE causes the database plan cache to be flushed. That one seems intuitive, right?  So why is the plan cache cleared when you enable Query Store, or change one of the settings?  To ensure that new Query Store data is not lost.  This relates to the internals of how Query Store works, which aren’t essential to dig into, the point is that this behavior is known by Microsoft and expected.

There is what I’d consider a documentation bug around describing what happens when you enable Query Store, but the fact that the plan cache gets cleared makes sense.

wrapr 1.5.0 Now On CRAN

John Mount announces wrapr 1.5.0:

wrapr includes a lot of tools for writing better R code:

John also includes an example using the coalesce operator %?%.


June 2018
« May