Press "Enter" to skip to content

Author: Kevin Feasel

Data Professional Salary Survey Now Open

Brent Ozar announces the 2020 edition of the Data Professional Salary Survey:

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:

Take the Data Professional Salary Survey now.

The anonymous survey closes Sunday, January 5, 2020. The results will be completely open source, and shared with the community for your analysis.

Please take a few minutes and fill this out before January 5th. I’d really love to see a lot of non-SQL Server professionals fill out the survey, as every year, I end up having to ignore database platform because it’s 95% SQL Server and 5% everything else.

Comments closed

Change Data Capture and Replication on Linux

Tejas Shah announces transactional replication and change data capture for SQL Server 2017 on Linux:

With SQL Server 2019, we introduced support for replication and CDC features for SQL Server on Linux by bringing in relevant components and subsystems within SQL Server core engine package. This support includes Snapshot replication, Transactional replication and CDC. Peer-to-peer transactional replication, merge replication and Oracle publishing are not supported.  

Today we are glad to announce that we have brought the replication and CDC features to SQL Server 2017, starting with Cumulative update 18 (CU18).

Tejas has a few links, but the key is just to update your server (or spin up a new Docker container with the latest CU and swap external database files over to it).

Comments closed

T-SQL Tuesday Roundup

Mala Mahadevan rounds up T-SQL Tuesday #121:

7 people out of 18 people talk of their new dream jobs. This should give all of us hope that there are good gigs out there, if we are stuck in a bad place. No gig is perfect, of course, but it helps to have a dream and a direction to get to a better place. Almost everyone mentions their involvement with community/#sqlfamily as a huge reward. If you are reading this and not an active part of community yet – please consider doing so. It helps to be among supportive, uplifting people who care for you and have similar goals as you do.

To conclude – I really liked these lines from Shane O Neill’s post – ‘ I believe the best gift is the gifts that you can give back. So, here’s hoping that the next year sees us all help more than hinder, learn more than laze, and teach more than troll.’

Click through for posts from the 18 respondents.

Comments closed

Slow Merge Replication Initialization

Gonzalo Bissio covers one reason why initializing merge replication might be slow:

Since 1 week (Yes… 1 week) they were waiting for the replication re-initialization to finish since they introduced some changes on the system and they needed to reinitialize the replication again. The rate of the records between the publisher and the subscribers were too low (20 records per second). Since this application is used globally they write records on all of the regions (then the merge replication manage them).

Read on to see what Gonzalo’s company had going on and how they fixed it.

Comments closed

YARN Container Sizing

Dmitry Tolpeko explains why large YARN containers aren’t always a great idea:

First I noticed that the job used only 100 containers i.e. just one container per cluster node. This was very suspicious as Hive uses the Apache Tez execution engine that can run concurrently only one task in a container.

Looking at the Hive script I found:

set hive.tez.container.size = 10240; -- 10 GB

Looks like someone had a memory problem with this query before and wanted to solve it once and forever!

Read on to see why this was not a great idea.

Comments closed

Geospatial Data Processing with Databricks

Razavi and Michael Johns walk us through examples of processing geospatial data with Databricks:

Earlier, we loaded our base data into a DataFrame. Now we need to turn the latitude/longitude attributes into point geometries. To accomplish this, we will use UDFs to perform operations on DataFrames in a distributed fashion. Please refer to the provided notebooks at the end of the blog for details on adding these frameworks to a cluster and the initialization calls to register UDFs and UDTs. For starters, we have added GeoMesa to our cluster, a framework especially adept at handling vector data. For ingestion, we are mainly leveraging its integration of JTS with Spark SQL which allows us to easily convert to and use registered JTS geometry classes. We will be using the function st_makePoint that given a latitude and longitude create a Point geometry object. Since the function is a UDF, we can apply it to columns directly.

Looks like they have some pretty good functionality here, and they have shared the demos in notebook form.

Comments closed

Fun with NULL

Itzik Ben-Gan takes us through some of the complexities of NULL:

NULL handling is one of the trickier aspects of data modeling and data manipulation with SQL. Let’s start with the fact that an attempt to explain exactly what a NULL is is not trivial in and of itself. Even among people who do have a good grasp of relational theory and SQL, you will hear very strong opinions both in favor and against using NULLs in your database. Like them or not, as a database practitioner you often have to deal with them, and given that NULLs do add complexity to your SQL code writing, it’s a good idea to make it a priority to understand them well. This way you can avoid unnecessary bugs and pitfalls.

This article is the first in a series about NULL complexities. I start with coverage of what NULLs are and how they behave in comparisons. I then cover NULL treatment inconsistencies in different language elements. Finally, I cover missing standard features related to NULL handling in T-SQL and suggest alternatives that are available in T-SQL.

This is definitely worth the read.

Comments closed

Power BI and Azure Synapse Analytics

James Serra gives us some insights on the future of Power BI and how it relates with Azure Synapse Analytics today:

As an example of the speed of each layer, during an Ignite session (view here), there was a Power BI query run against 26 billion rows that was returning a sum of store sales by year. The same query was run three times using a different layer:

1. Using a DirectQuery against tables in SQL DW took 8 seconds
2. Using a DirectQuery against a materialized view in SQL DW took 2.4 seconds.  Note you don’t have to specify that you are using a materialized view in the query, as the SQL DW optimizer will know if it can use it or not
3. Using a Aggregation table that is Imported into Power BI took 0 milliseconds

Keep in mind this is all hidden from user – they just create the report.  If they do a query against a table not in memory in Power BI, it will do a DirectQuery against the data source which could take a while.  However, due to SQL DW result-set caching, repeat DirectQuery’s can be very fast (in the Ignite session they demo’d a DirectQuery that took 42 seconds the first time the query was run, and just 154 milliseconds the second time the query was run that used result-set caching).

There’s some interesting information in here, especially around Power BI eventually taking over Azure Analysis Services’ space in the market.

Comments closed

Azure Data Factory Triggers

Cathrine Wilhelmsen continues a series on Azure Data Factory by looking at triggers:

One important thing to note is that all times are in UTC. And since UTC does not observe daylight saving time… Well, let’s just say that if you need to execute pipelines during the workday and you have business users waiting for data, you may want to plan some trigger maintenance on the days when you fall back or spring forward. I know. Ugh 🙂 I’m hoping for better timezone support in the future 🤞🏻

Schedule triggers and pipelines have a many-to-many relationship. That means that one schedule trigger can execute many pipelines, and one pipeline can be executed by many schedule triggers.

Time-based triggers aren’t the only options, however—Cathrine also looks at the other three possibilities.

Comments closed

Posting SQL Query Results to Teams with Powershell

Michael Bourgon shows how we can post SQL Server result sets to Microsoft Teams channels:

So…. you want to post to a Teams channel automagically.  Should be simple, and it is!  Alas, it means you have to ignore most of the documentation.  Let’s do this!

1) Here’s how to hook it up with your channel.  Note that when I created a brand new “Team”, it took about 5-10 minutes before I was able to add the webhook connector – prior to that, I got a “channel does not exist or has been deleted”.

Michael takes us through it step by step and also includes things you should avoid, including misleading documentation.

Comments closed