Kevin Feasel – Page 1050

Why Disabling the Clustered Index is a Bad Idea

Published 2019-12-17 by Kevin Feasel

Kenneth Fisher has an experiment in mind:

You are probably already aware that you can disable an index. This can be handy when you have a large load and the load + re-enabling the indexes (you have to completely rebuild them) is faster than leaving the indexes in place. I’ve had pretty limited occasions where this has helped but it can be a handy trick at times. That said, this is only true for non-clustered indexes. What happens when you disable the clustered index?

Nothing good, that’s what.

Comments closed

A Preview of R 4.0

Published 2019-12-16 by Kevin Feasel

David Smith takes a look at what’s coming in R 4.0:

Normalization of matrix and array types. Conceptually, a matrix is just a 2-dimensional array. But current versions of R handle matrix and 2-D array objects differently in some cases. In R 4.0.0, matrix objects will formally inherit from the array class, eliminating such inconsistencies.
A refreshed color palette for charts. The base graphics palette for current versions of R (shown as R3 below) features saturated colors that vary considerably in brightness (for example, yellow doesn’t display as prominently as red). In R 4.0.0, the palette R4 below will be used, with colors of consistent luminance that are easier to distinguish, especially for viewers with color deficiencies. Additional palettes will make it easy to make base graphics charts that match the color scheme of ggplot2 and other graphics systems.

Read on for more, as well as a link to the rest of the changes.

Comments closed

More with TOP and Blocking Operators

Published 2019-12-16 by Kevin Feasel

Jared Poche continues an investigation into the TOP operator:

I’ve explained what a blocking operator is and provided a few examples, but maybe this doesn’t seem important. It’s affecting the TOP operator, sure, but don’t people just use this to look at the TOP 1000 rows of their tables in SSMS?
The TOP operator is useful for many operations, especially in a large environment. Large operation can timeout or fail for a variety of reasons, consuming resources without providing the results you need. A small, batch-sized operation is more likely to succeed and tends to perform more consistently. Many maintenance operations make sense to run with a TOP operator, so we should make sure those operations aren’t stymied by blocking operators.

Read on for several examples.

Comments closed

More with Azure Data Factory

Published 2019-12-16 by Kevin Feasel

Cathrine Wilhelmsen continues a series on Azure Data Factory. Since we left off, Cathrine has three new posts. First, a look at how we monitor Azure Data Factory pipelines:

In the previous post, we looked at the three different trigger types, as well as how to trigger pipelines on-demand. In this post, we will look at what happens after that. How does monitoring work in Azure Data Factory?
Now, if we want to look at monitoring, we probably need something to monitor first. I mean, I could show you a blank dashboard, but I kind of already did that, and that wasn’t really interesting at all 🤔 So! In the previous post, I created a schedule trigger that runs hourly, added it to my orchestration pipeline, and published it.

Second, using annotations to customize views:

Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources.
You need to figure out what kind of annotations make sense to you. Maybe you want to filter on the different logical steps of your solution, so you add the tags extract and transform? Perhaps ingest and prepare? Or maybe you want to tag resources with a business unit or a project name? It’s entirely up to you. All I recommend is that you’re consistent 🙂

That’s a problem for me—the only thing I’m consistent about is inconsistency. Third, Cathrine introduces the different runtimes available to us:

An integration runtime (IR) specifies the compute infrastructure an activity runs on or gets dispatched from. It has access to resources in either public networks, or in public and private networks.
Or, in Cathrine-speak, using less precise words: An integration runtime specifies what kind of hardware is used to execute activities, where this hardware is physically located, who owns and maintains the hardware, and which data stores and services the hardware can connect to.

There’s a lot of good material in each of these three posts.

Comments closed

Change Data Capture and Replication on Linux

Published 2019-12-16 by Kevin Feasel

Tejas Shah announces transactional replication and change data capture for SQL Server 2017 on Linux:

With SQL Server 2019, we introduced support for replication and CDC features for SQL Server on Linux by bringing in relevant components and subsystems within SQL Server core engine package. This support includes Snapshot replication, Transactional replication and CDC. Peer-to-peer transactional replication, merge replication and Oracle publishing are not supported.
Today we are glad to announce that we have brought the replication and CDC features to SQL Server 2017, starting with Cumulative update 18 (CU18).

Tejas has a few links, but the key is just to update your server (or spin up a new Docker container with the latest CU and swap external database files over to it).

Comments closed

Data Professional Salary Survey Now Open

Published 2019-12-16 by Kevin Feasel

Brent Ozar announces the 2020 edition of the Data Professional Salary Survey:

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:
Take the Data Professional Salary Survey now.
The anonymous survey closes Sunday, January 5, 2020. The results will be completely open source, and shared with the community for your analysis.

Please take a few minutes and fill this out before January 5th. I’d really love to see a lot of non-SQL Server professionals fill out the survey, as every year, I end up having to ignore database platform because it’s 95% SQL Server and 5% everything else.

Comments closed

T-SQL Tuesday Roundup

Published 2019-12-16 by Kevin Feasel

Mala Mahadevan rounds up T-SQL Tuesday #121:

7 people out of 18 people talk of their new dream jobs. This should give all of us hope that there are good gigs out there, if we are stuck in a bad place. No gig is perfect, of course, but it helps to have a dream and a direction to get to a better place. Almost everyone mentions their involvement with community/#sqlfamily as a huge reward. If you are reading this and not an active part of community yet – please consider doing so. It helps to be among supportive, uplifting people who care for you and have similar goals as you do.
To conclude – I really liked these lines from Shane O Neill’s post – ‘ I believe the best gift is the gifts that you can give back. So, here’s hoping that the next year sees us all help more than hinder, learn more than laze, and teach more than troll.’

Click through for posts from the 18 respondents.

Comments closed

Slow Merge Replication Initialization

Published 2019-12-16 by Kevin Feasel

Gonzalo Bissio covers one reason why initializing merge replication might be slow:

Since 1 week (Yes… 1 week) they were waiting for the replication re-initialization to finish since they introduced some changes on the system and they needed to reinitialize the replication again. The rate of the records between the publisher and the subscribers were too low (20 records per second). Since this application is used globally they write records on all of the regions (then the merge replication manage them).

Read on to see what Gonzalo’s company had going on and how they fixed it.

Comments closed

YARN Container Sizing

Published 2019-12-13 by Kevin Feasel

Dmitry Tolpeko explains why large YARN containers aren’t always a great idea:

First I noticed that the job used only 100 containers i.e. just one container per cluster node. This was very suspicious as Hive uses the Apache Tez execution engine that can run concurrently only one task in a container.
Looking at the Hive script I found:
set hive.tez.container.size = 10240; -- 10 GB
Looks like someone had a memory problem with this query before and wanted to solve it once and forever!

Read on to see why this was not a great idea.

Comments closed

Geospatial Data Processing with Databricks

Published 2019-12-13 by Kevin Feasel

Razavi and Michael Johns walk us through examples of processing geospatial data with Databricks:

Earlier, we loaded our base data into a DataFrame. Now we need to turn the latitude/longitude attributes into point geometries. To accomplish this, we will use UDFs to perform operations on DataFrames in a distributed fashion. Please refer to the provided notebooks at the end of the blog for details on adding these frameworks to a cluster and the initialization calls to register UDFs and UDTs. For starters, we have added GeoMesa to our cluster, a framework especially adept at handling vector data. For ingestion, we are mainly leveraging its integration of JTS with Spark SQL which allows us to easily convert to and use registered JTS geometry classes. We will be using the function st_makePoint that given a latitude and longitude create a Point geometry object. Since the function is a UDF, we can apply it to columns directly.

Looks like they have some pretty good functionality here, and they have shared the demos in notebook form.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Author: Kevin Feasel