Press "Enter" to skip to content

Day: September 7, 2018

Integrating Kafka Into A Data Scientist’s Workflow

Liz Bennett from Stitch Fix has a guest post on the Confluent blog:

Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.

This is an architectural overview and a good read.

Comments closed

Databricks UDF Performance Testing

Tristan Robinson shares some performance comps for different Azure Databricks scenarios:

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

Click through for the results.  Looks like Python barely beat out Scala for the #1 position, but Scala was a little faster than Python in-class (e.g., the Scala program with a Scala SQL UDF was a little bit faster than the Python equivalent).

Comments closed

Working With Dates And Times In T-SQL

Tomaz Kastrun walks us through various functions to work with dates and times in T-SQL:

Manipulating date and time in T-SQL is a daily and very common task that every DBA, SQL Developer, BI Developer and data scientist will come across. And over the years, I have accumulated many of the simple date or/and time manipulation combinations of different functions, that it is time, to put them together.

Don’t expect to find here anything you haven’t used or seen – especially, if you are a long time T-SQL developer. The point is to have a post, that will have a lot of examples on date and time manipulation on one place. And by no means, this is not the definite list, but should be quite substantial and the code on Github repository will be update.

The list will be updated on my Github, and therefore this blogpost might not include all. In all of the following examples I will be using function GETDATE() to get the current datetime, unless the examples will have stored dates. Therefore, some of the examples or screen-prints will be different from yours.

This mostly focuses on the DATETIME type rather than DATETIME2 or DATE, but there are a few TIME uses.  Check out Tomaz’s repo for more.

Comments closed

SSMS 17.9 Released

Alan Yu announces a new version of SQL Server Management Studio:

SSMS 17.9 provides support for almost all feature areas on SQL Server 2008 through the latest SQL Server 2017, which is now generally available.

In addition to enhancements and bug fixes, SSMS 17.9 comes with several new features:

  • ShowPlan improvements
  • Azure SQL support for vCore SKUs
  • Bug Fixes

View the Release Notes for more information.

It looks like the big push for this release was bug fixes, and there are quite a few of them.

Comments closed

Quartiles In DAX

Dustin Ryan shows us how to calculate quartiles using DAX:

To calculate the quartile, we’re going to use the PERCENTILEX.INC DAX function. The PERCENTILEX.INC function returns the number at the specified percentile. So for example, if I had numbers 0 and 100 in my data set, the 25th percentile value would be 25. The 50th percentile value would be 50 and the 75th percentile value would be 75, and you can figure out what the 100th percentile value would be.

Dustin shares an example with his NFL data set and also walks us through a couple of tricky situations.

Comments closed

SQL On Linux AD: Group Membership Issues

Dylan Gray and Tejas Shah continue their series on troubleshooting issues when connecting to SQL Server on Linux via an Active Directory account:

Imagine a scenario where after a successful AD login and running a couple queries, some users may see the error “Could not obtain information about Windows NT group/user ‘CONTOSO\user’.” This is due to a failure when searching for group memberships for the logged in user and can be easily fixed.

When a user logs in, their group memberships are looked up and used to determine if they have the privileges to login. Once the user is connected SQL Server must validate their group memberships in many scenarios, to make sure their effective access permissions have not changed. For example, if user CONTOSO\user1 was a member of CONTOSO\group1, and CONTOSO\group1 has login permission for the SQL Server instance, then CONTOSO\user1 can login successfully. However, if after CONTOSO\user1 logs in, they are removed from CONTOSO\group1 by a domain admin, then their access to SQL Server should be revoked.

Click through to see what SQL Server on Linux uses to check AD group information and what you can do if there’s a problem.

Comments closed

New dbatools Release

Chrissy LeMaire announces new changes to dbatools:

Marching onward to dbatools 1.0, a ton of commands have been renamed to align with our now mature naming scheme. These changes were made in today’s release, version 0.9.410 aka regularlegs, now available on GitHub and the PowerShell Gallery.

Here’s the general idea:

  • DbaDatabase has mostly been renamed to DbaDb with a couple exceptions

  • DbaSql has been changed to just Dba, no exceptions. Think of Dba as “Sql”.

  • DbaConfig has been renamed to DbatoolsConfig

  • TempDbConfiguration has been renamed to TempdbConfig

  • All Configuration commands are Config except SpConfigure

  • DbaDacpac has been renamed to DbaDacPackage. Dac is the prefix for our data-tier application commands.

  • DbaDbQueryStoreOptions has been renamed to DbaDbQueryStoreOption

Some of this is in preparation for breaking changes in dbatools 1.0.  There’s a lot of stuff in this release, so check it out.

Comments closed

Migrating Azure SQL Databases Between Resource Groups

Arun Sirpal shows us a method for moving Azure SQL Databases between resource groups:

In my mind there are a couple of ways to move a database across resource groups. They vary from scripting to just using the Azure portal. I am going to use the Azure portal and do the following.

  1. Export a database in resource group X to a storage account Z.
  2. Import the file from the storage account Z into a database that is in resource group Y.

It’s just like a “backup and restore” strategy, all with the assumption that you are working within the same subscription ID.

Read on for a step-by-step demonstration on how to do this.

Comments closed