2019-05-21 – Curated SQL

Overriding Spark Dependencies

Published 2019-05-21 by Kevin Feasel

Landon Robinson shows how to override a Spark dependency located on the classpath:

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:
– gson-2.2.4.jar: the method is private, and therefore too old for use here
– gson-2.6.1: the method is public, and works fine.
– Somewhere between the two, the method’s status changed.
So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

Spoilers: there was a problem.

Comments closed

Kafka and MirrorMaker

Published 2019-05-21 by Kevin Feasel

Renu Tewari describes what MirrorMaker does for Kafka today and what is coming with version 2:

Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking clickstream event data, collecting logs, gathering metrics, and being the enterprise data bus in a microservices based architectures. Kafka is essentially a highly available and highly scalable distributed log of all the messages flowing in an enterprise data pipeline. Kafka supports internal replication to support data availability within a cluster. However, enterprises require that the data availability and durability guarantees span entire cluster and site failures.
The solution, thus far, in the Apache Kafka community was to use MirrorMaker, an external utility, that helped replicate the data between two Kafka clusters within or across data centers. MirrorMaker is essentially a Kafka high-level consumer and producer pair, efficiently moving data from the source cluster to the destination cluster and not offering much else. The initial use case that MirrorMaker was designed for was to move data from clusters to an aggregate cluster within a data center or to another data center to feed batch or streaming analytics pipelines. Enterprises have a much broader set of use cases and requirements on replication guarantees.

Read on for the list of benefits and upcoming features.

Comments closed

Collecting Hadoop Metrics from Multiple Clusters

Published 2019-05-21 by Kevin Feasel

Dmitry Tolpeko shows how you can collate Hadoop metrics from several ElasticMapReduce clusters:

The first step is to dynamically get the list of clusters and their IPs. Hadoop clusters are often reprovisioned, added and terminated, so you cannot use the static list and addresses. In case of Amazon EMR, you can use the following Linux shell command to get the list of active clusters:
aws emr list-clusters --active
From its output you can get the cluster IDs and names. As a cluster ID and IP can change over time, its name is usually permanent (like DEV or Adhoc-Analytics cluster) so it can be useful for various aggregation reports.

Read on to see what you can do with this list of clusters.

Comments closed

Undercover Inspector 1.4

Published 2019-05-21 by Kevin Feasel

Adrian Buckman takes us through recent changes in Undercover Inspector:

#119 When the backups check module reports backup issues for a database but the issue is with a FULL or DIFF and the LOG is ok, we now show just the primary server in the Preferred replicas column as a FULL and DIFF only applies to the Primary – this reduces the number of warnings raised within the report as it will no longer report for all replica nodes if the AG backup preference is set to Prefer secondary or Secondary Only. See Git issue for more details.

Click through for the full change set.

Comments closed

Distributed Computing Fallacies

Published 2019-05-21 by Kevin Feasel

Samir Behara takes us through a few fallacies with distributed computing:

The network is reliable
Service calls made over the network might fail. There can be congestion in network or power failure impacting your systems. The request might reach the destination service but it might fail to send the response back to the primary service. The data might get corrupted or lost during transmission over the wire. While architecting distributed cloud applications, you should assume that these type of network failures will happen and design your applications for resiliency.
To handle this scenario, you should implement automatic retries in your code when such a network error occurs. Say one of your services is not able to establish a connection because of a network issue, you can implement retry logic to automatically re-establish the connection.

There are some very good points in here.

Comments closed

Finding Three-Part and Four-Part Names

Published 2019-05-21 by Kevin Feasel

Pamela Mooney shows how you can find three-part or four-part naming on a SQL Server instance:

The script below searches the metadata for views, sprocs and functions for occurrences of 3 and 4 part names. Three-part names consist of databasename.schemaname.objectname, and four-part names consist of servername.databasename.schemaname.objectname. Because the code searches metadata, it isn’t always perfect. If your comments mention a servername followed by a period, for example, it will be caught. Nevertheless, it’s a great place to begin looking, and a real help in getting rid of problems before they really bite you.

Click through for the script.

Comments closed

Modifying XML in T-SQL

Published 2019-05-21 by Kevin Feasel

Max Vernon takes us through the .modify function:

Determining the property syntax when modifying XML values in SQL Server can be time consuming if you don’t work with XML regularly. SQL Server includes a very flexible XML subsystem, called XML_DML, or XML Data Manipulation Language. XML_DML can be used to easily and effectively update XML values in an xml-typed column or variable. This question on dba.stackexchange.comasked about using the .modify function to change the value of an element, which in turn prompted this post.

Read on for a number of examples.

Comments closed

Azure SQL Database Serverless

Published 2019-05-21 by Kevin Feasel

Arun Sirpal takes us through Azure SQL Database Serverless:

This is best used for those single databases that are ever changing with unpredictable patterns. With the concept of being billed per second (based on the vcores used) rather than per hour means that pricing can become more granular especially now with auto-pause becoming possible. The auto-pause delay defines the period of time the database must be inactive before it is automatically paused (only charged for storage). You should only use this if you can afford some delay in compute warm-up after idle usage periods, otherwise it is best to stick with provisioned compute tiers ( classic tiers).

I could see this being useful for dev or test databases, or maybe a personal site with heavy external caching.

Comments closed

dbatools 1.0 Forthcoming

Published 2019-05-21 by Kevin Feasel

Chrissy LeMaire announces that dbatools will be out on June 19th by my count:

We’ve got about 30 issues left to resolve which you can see and follow on our GitHub Projects page. If you’ve ever been interested in helping, now is the perfect time as we only have 30 more days left to reach our goal.
If you’re a current or past dbatools developer, we’d love any help we can get. Just hit up the GitHub Projects page to see what issues are left to resolve. If someone is already assigned, please reach out to them on Slack in the #dbatools-dev channel and see if they can use your help.

Read the whole thing and see if there’s anything you can do to help.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Day: May 21, 2019