2018-10-19 – Curated SQL

Mounting HDFS As A Local Filesystem

Published 2018-10-19 by Kevin Feasel

Guy Shilo looks at two techniques for mounting HDFS as a local filesystem:

NFS Gateway is a HDFS component that enables the use to expose HDFS through NFS3 interface so that Linux machines can mount it and access it just as a local filesystem.

The manual installation is quite cumbersome and is covered here.

Cloudera manager automates the process so we will use it. If you do not already have NFS Gateway installed in your Cloudera cluster, go to HDFS -> Instances -> Add role instances and choose a host for NFS Gateway:

Guy also looks at Fuse and runs a quick test to see which is faster.

Comments closed

How Humio Uses Kafka

Published 2018-10-19 by Kevin Feasel

Kresten Krab describes ways that Humio uses Apache Kafka for their product:

Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for “on-prem first” because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.

From a software provider’s point of view, fixing issues in an on-prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby (we recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes).

We use Kafka for two things: buffering ingest and as a sequencer of events among the nodes of a Humio cluster.

Read on for more details and a few tips on using Kafka to its fullest.

Comments closed

Oddity With User Write Count In dm_db_index_usage_stats

Published 2018-10-19 by Kevin Feasel

Shaun J. Stuart looks at an oddity with the user_updates column on sys.dm_db_index_usage_stats:

This pulls both reads and writes from the sys.dm_db_index_usage_stats dynamic management view. A read is defined as either a seek, scan, or lookup and a write is defined as an update. All seemed good until I noticed something strange. One of the top written to tables was, based on our naming convention, a lookup table. That seemed odd. A lookup table should have lots of reads, but only few writes. The query above showed my lookup table had almost twice as many writes as reads!

I dug around a bit and found two stored procedures that referenced that particular table. I checked them out, but nothing seemed out of the ordinary to me, so I dug a little deeper and discovered something strange: theuser_updates value of sys.dm_db_index_usage_stats can get incremented even when there is no actual update to the table!!

Read on for the explanation.

Comments closed

Comparing TensorFlow Versus PyTorch

Published 2018-10-19 by Kevin Feasel

Anirudh Rao compares PyTorch to TensorFlow:

For small-scale server-side deployments both frameworks are easy to wrap in e.g. a Flask web server.

For mobile and embedded deployments, TensorFlow works really well. This is more than what can be said of most other deep learning frameworks including PyTorch.

Deploying to Android or iOS does require a non-trivial amount of work in TensorFlow.

You don’t have to rewrite the entire inference portion of your model in Java or C++.

Other than performance, one of the noticeable features of TensorFlow Serving is that models can be hot-swapped easily without bringing the service down.

Read on for the full comparison.

Comments closed

Master Data Services No Longer Uses Silverlight

Published 2018-10-19 by Kevin Feasel

Niko Neugebauer is happy about an update to Master Data Services in SQL Server 2019:

Before we continue, let me ask you one question, have you heard about Silverlight?
Or in other words, and with a kind of evil voice “DID YOU EVER INSTALLED SILVERLIGHT ON A PRODUCTION SERVER”?.
If you have worked with MDS oh yes, you did! At least in order to check if everything is configured/upgraded correctly and nothing is broke, I will do a wild guess and claim that you did! So am I … :s

Because in order to make things work in MDS correctly, one needs this old, for a very long time deprecated framework, that is supported only in deprecated browser that is called Internet Explorer v.11, and that pain-in-the-neck framework is called Silverlight and if you dare to work with any SQL Server versions before SQL Server 2019, the picture on the left will appear in front of you at the moment you will try to explore the master data in the MDS Explorer – ensuring that unless you install a totally abandoned (and obviously unnecessary product, that represents another risk on your server) is a necessary thing. That is alone is the reason for some people would use some development VM in order to work with MDS, but that is not a good excuse to include that product in SQL Server 2016 or in SQL Server 2017.

The interface still has problems, as Niko points out, but hopefully this is the first step and not the last one.

Comments closed

Looking At Databricks Cluster Pricing

Published 2018-10-19 by Kevin Feasel

Tristan Robinson takes a look at Azure Databricks pricing:

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost. In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

There are a few complicating factors in figuring out cluster price but rest assured that it will be costly.

Comments closed

Sorting And Aggregating Extended Events Results

Published 2018-10-19 by Kevin Feasel

Matthew McGiffen shows off some of the things you can do easily with Extended Events Profiler:

When I’m using Profiler to analyse performance issues I often save the results to a table, or upload a trace file into a table, so that I can analyse the data. Often this involves aggregating the values for particular queries so that I can see the most resource hungry.

This is by no means a difficult process, but with Extended Events (XE) it’s arguably even easier.

Click through for a demonstration.

Comments closed

Validating SSIS Packages Using T-SQL

Published 2018-10-19 by Kevin Feasel

Annie Xu shows us how to validate SSIS packages in the SSISDB catalog using T-SQL:

Recently, I need to do a data warehouse migration for a client. Since there might be some difference between the Dev environment source databases and Prod environment source databases. The migrated SSIS packages for building data warehouse might have some failures because of the changes. So the challenge is how can I validate all my DW packages (100 +) all at once.

Click through for the script.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: October 19, 2018