Press "Enter" to skip to content

Author: Kevin Feasel

Apache Avro Now Supported In Spark 2.4

Gengliang Wang, et al, announce built-in support for Apache Avro in Spark 2.4:

Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on). In addition, it provides:

  • New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files.
  • Avro logical types support, including Decimal, Timestamp, and Date types. See the related schema conversions for details.
  • 2X read throughput improvement and 10% write throughput improvement.

In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits.

Avro is one of the better rowstore data formats in the Hadoop world, so it’s good to see built-in support here.

Comments closed

Improvements To The SQL Server Availability Group Failover Detection Utility

Rob Sewell has a few improvements to the SQL Server Availability Group Failover Detection Powershell function:

Archive the data for historical analysis

One of the production DBAs pointed out that having gathered the information, it would be useful to hold it for better analysis of repeated issues. I have added an archiving step so that when the tools runs, if there is already data in the data gathering folder, it will copy that to an archive folder and name it with the date and time that the cluster log was created as this is a good estimation of when the analysis was performed. If an archive folder location is not provided it will create an archive folder in the data folder. This is not an ideal solution though, as the utility will copy all of the files and folders from there to its own location so it is better to define an archive folder in the parameters.

There are several improvements in here, so check them out.

Comments closed

Building Hello World With Java In SQL Server 2019

Niels Berglund shows how you can use sp_execute_external_script to run Java code in SQL Server 2019:

In SQL Server 2019 Microsoft added the ability to execute custom Java code along the same lines we execute R and Python, and this blog post intends to give an introduction of how to install and enable the Java extension, as well as execute some very basic Java code. In future posts, I drill down how to pass data back and forth between SQL Server and Java.

There may very well be future posts discussing how the internals differ between Java and R/Python, but I want to talk about that a little bit in this post as well, as it has an impact on how we write and call Java code.

The not-so-secret here is that Java itself is less interesting of a language than, say, Scala.  And the reason you’d support Scala?  To interact with an Apache Spark cluster.  I think that’s a big part of why you’d want the installer to load Java 1.8 instead of 1.9 or later (which contain API changes which break Spark).  Definitely give this a careful read, as there are more working parts and more gotchas than R or Python support.

Comments closed

What’s New In SQL Server 2019 CTP 2.1 Graph Support

Niko Neugebauer looks at a few additions to SQL Server graph support:

Now, in the next step we shall create a derived view, which shall contain the list with all Persons and Businesses, joining them together:

CREATE OR ALTER VIEW dbo.Followers AS 
	SELECT PersonId as Id, FullName
		FROM dbo.Person
	UNION ALL
	SELECT BusinessId, BusinessName
		FROM dbo.Business;

Now, the real new thing is that we can use such derived tables in SQL Server 2019 CTP 2.1 and Azure SQL Database together with the MATCH clause, in the statements such as the one below where we list all the followers of the “Real Stuff” company:

SELECT Followers.ID, Followers.FullName
	FROM Followers, Follows, Company
	WHERE MATCH(Followers-(Follows)->Company)
		AND CompanyName = 'Real Stuff'

This query works fine, delivering us the expected results while generating a pretty complex execution plan in the background.

Niko focuses on heterogeneous nodes and edges, as well as derived views.

Comments closed

Running SQL Server 2019 In A Docker Container

Cathrine Wilhelmsen shows us how to set up a Docker container running SQL Server 2019 on Linux:

In this post, I share my approach and code snippets for:

  1. Installing Docker

  2. Getting SQL Server 2019

  3. Running SQL Server 2019 in a Docker Container

  4. Restoring Demo Databases (AdventureWorks and WideWorldImporters)

If your hardware supports Docker, this is a great way of getting some experience with a new version of SQL Server without the mess of cleaning up after a CTP or affecting your current dev environment.

Comments closed

Understanding Power BI Service Administrator Permissions

Melissa Coates walks us through Power BI permissions:

Based on the tests I’ve been doing, I’ve observed that users with membership to the Power BI administrator role have two sets of permissions apply:

  • Activities which are scoped across the entire organization
  • Activities for which normal user permissions apply

Within the above 2 categories, I’m thinking there are 4 main types of activities:

  1. Manage tenant settings (always scoped to the organization)
  2. Compile inventory and metadata (can be scoped to the organization)
  3. Manage workspace users (can be scoped to the organization)
  4. Export content from a workspace (relies on user permissions)

There’s a fair amount to digest, but Melissa does a good job explaining the implications of specific permissions.

Comments closed

Gradient Boosting And XGBoost

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Comments closed

Visualizing Traditional Japanese Color Palettes

Chisato den Engelsen looks at 465 traditional color palettes used in Japan:

Since each of colours had name, I also was curious if there are some characters that are used more often than other. Colour name was written in two ways in this website. One in Kanji and other in Hiragana.

I love wordcloud2 to visualize the wordcloud, so I can see which characters appears more often the others.

It’s an interesting exercise and all of the R code is included.  Be sure to check out the list of colors with a character representing “rat” or “mouse” in the name.  H/T R-Bloggers

Comments closed

Testing Scalar UDF Performance In SQL Server 2019 CTP 2.1

Brian Hansen takes a look at the scalar UDF performance improvements in SQL Server 2019:

In one of my sessions, Set Me Up: How to Think in Sets, I discuss a variety of performance-inhibiting query constructs, including scalar UDFs. I thought it would be interesting to take the simple scalar function that I use in the demo and see what kind of difference that scalar inlining might make.

First, I restored the CorpDB database that I use in the session to my SQL Server 2019 CTP 2.1 instance and initially set the compatibility level to 140. I also ran script 001 from the demo to create the needed database tables (no need to create the CLR objects for this test). I then ran script 030 to execute the scalar UDF test. In a nutshell, this script

  • creates a UDF

  • runs a query that calls the UDF about 13,000 times, capturing the time required to do so

  • repeated this test five times

  • discards the fastest and slowest tests

  • reports the average time for the remaining three tests

If I’m reading Brian’s notes right, it’s still slower than writing the set-based solution yourself, but a huge improvement over the prior scalar function performance.

Comments closed

Monitoring At Stack Overflow

Nick Craver has been driven off the bend by monitoring and we get to enjoy the fruits of it:

…but evidently some people think of other things. Those people are obviously wrong, but let’s continue. When I’m not a walking zombie after reading a 10,000 word blog post some idiot wrote, I see monitoring as the process of keeping an eye on your stuff, like a security guard sitting at a desk full of cameras somewhere. Sometimes they fall asleep–that’s monitoring going down. Sometimes they’re distracted with a doughnut delivery–that’s an upgrade outage. Sometimes the camera is on a loop–I don’t know where I was going with that one, but someone’s probably robbing you. And then you have the fire alarm. You don’t need a human to trigger that. The same applies when a door gets opened, maybe that’s wired to a siren. Or maybe it’s not. Or maybe the siren broke in 1984.

I know what you’re thinking: Nick, what the hell? My point is only that monitoring any application isn’t that much different from monitoring anything else. Some things you can automate. Some things you can’t. Some things have thresholds for which alarms are valid. Sometimes you’ll get those thresholds wrong (especially on holidays). And sometimes, when setting up further automation isn’t quite worth it, you just make using human eyes easier.

This is a really good post covering monitoring techniques at a high level and getting into specific implementations at Stack Overflow.

Comments closed