Press "Enter" to skip to content

Author: Kevin Feasel

Sqoop Scheduling and Security

Jon Moirsi continues a series on Sqoop:

In previous articles, I’ve walk through using Sqoop to import data to HDFS.  I’ve also detailed how to perform full and incremental imports to Hive external and Hive managed tables.

In this article I’m going to show you how to automate execution of Sqoop jobs via Cron.

However, before we get to scheduling we need to address security.  In prior examples I’ve used -P to prompt the user for login credentials interactively.  With a scheduled job, this isn’t going to work.  Fortunately Sqoop provides us with the “password-alias” arg which allows us to pass in passwords stored in a protected keystore.

That particular keystore tie-in works quite smoothly in my experience.

Comments closed

Two Query Store Stories

Mark Wilkinson gives us two separate takes on Query Store:

When the Query Data Store (QDS) feature was announced for SQL Server 2016, we were excited about the prospect of being able to have deep insight on any query running in our environment. I work for a company that deals heavily in the e-commerce space, and we have a large SQL Server footprint. Our environment is unique in that is essentially a multi-tenant system, but all the tenants could have wildly different workloads. It’s really the kind of query execution scenario QDS was built for. We had the pleasure of working with the Microsoft SQLCAT team to get 2016 and QDS up and running in our production environment before it was GA.

In this post I’m going to share two stories about our QDS experience (from pre and post GA of the feature). One from the perspective of the Database Developer, and one from the Database Administrator. For the most part this is not a technical post full of queries and code samples. It’s just me talking about some things I have experienced using QDS in production.

CentralQDS, by the way, is really cool. Hopefully we’re able to show that to the world someday (and note how I say “we” even though I did absolutely nothing with it except for being in the same company as the people who developed it).

Comments closed

Azure Data Studio and Query Store

Rob Farley explains why Azure Data Studio can’t be a 100% thing right now:

But the big thing with this new laptop is that I’ve made a conscious decision about what to install on it. And particularly, what things to NOT install.

For example, I’ve only installed SQL inside docker, not directly on Windows. I’m running the Linux version just because I can.

But today I feel like I’ve compromised.

Today I’ve installed SSMS, instead of persisting with just Azure Data Studio. It only took a week for me to cave, and the reason is Query Store.

Read on to understand why. I ended up having to cave on Docker-only for SQL Server because of PolyBase components that I needed and which aren’t (yet?) supported in the Linux version.

Comments closed

Loading Data from CSVs with Inconsistent Quoted Identifiers

Dave Mason has some fun with loading data from files:

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I’ve blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I’ve never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I’ve never found a way around this limitation.

Let’s demonstrate with BCP. Below is a sample data file I’ll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

I get unduly frustrated with the implementations of various data loaders around SQL Server and how they handle quoted identifiers differently. And don’t get me started on PolyBase.

Comments closed

Improving vCenter Performance Metric Logging

David Klee has some recommendations on settings for vCenter performance metric collection:

The default performance metric statistics collection interval within vCenter is to start rolling up data in an aggregation method starting at just one hour. Much of the data necessary for troubleshooting of performance challenges reported either same day or on the previously day is lost from the vCenter data and forces the administrator to revert to cumbersome and/or time-consuming tooling, such as vRealize Operations Manager. DBAs might not have access to such tools. Hopefully by now they have read-only access to vCenter!

The vCenter performance statistics collection and rollup settings can be customized to provide a longer window of time for critical metrics to be available to the administrator for management.

Click through for some recommendations of aggregation intervals and collection durations to help with virtual machine troubleshooting.

Comments closed

Dynamic Binning with Power BI

Reza Rad has fun with dynamic binning in Power BI:

In the previous article/video, I explained how to create dynamic bins by choosing the count of bins in a slicer in the Power BI report. In this article, I’ll explain, how you can do it the other way around, which is by selecting the size of the bin, you will have bins and buckets dynamically generated.

I like this for its ability to let you select the proper number and size of bins when Power BI is being particularly obstinate about something. In an ideal world, I don’t like this so much as a user-facing feature because we as designers should know the proper number and size of bins.

Comments closed

Finding Long-Running Queries with system_health

Grant Fritchey shows us where we can find long-running queries easily:

Wouldn’t it be great to just quickly and easily take a look at your system to see if you had any queries that ran for a long time, but, without actually doing any work to capture query metrics?

Oh, yeah, I can do that right now, and so can you.

All we need is something that is built into every single server you currently have under management (sorry, not Azure SQL Database) that is SQL Server 2008 or better: system_health

Grant then ties this into general benefits of Extended Events and shows how you can query and view the results.

Comments closed

Visualizing Rating Data

Stephanie Evergreen shows various ways to visualize ratings data:

Stacked Bars *seem* like a good idea – we show 100%, we can fit more questions and data into a similar amount of space – advantages, right? Except that stacked bars are difficult for people to read. How well can you compare the values of the orange segments? Not so much.

If you are going to use stacked bars, make some helpful formatting tweaks, like smarter color coding and an order from greatest to least.

A lot of this comes down to simplification and reduction of possibilities. Read the whole thing.

Comments closed

Connecting Kafka to Elasticsearch

Danny Kay and Liz Bennett build an example of writing Kafka topic data to Elasticsearch:

The Elasticsearch sink connector helps you integrate Apache Kafka® and Elasticsearch with minimum effort. You can take data you’ve stored in Kafka and stream it into Elasticsearch to then be used for log analysis or full-text search. Alternatively, you can perform real-time analytics on this data or use it with other applications like Kibana.

For some background on what Elasticsearch is, you can read this blog post by Sarwar Bhuiyan. You can also learn more about Kafka Connect in this blog post by Tiffany Chang and in this presentation from Robin Moffatt.

This is a demo-heavy walkthrough, so check it out.

Comments closed

Metadata-Driven ADF Pipelines

Paul Andrew wraps up a series on metadata-driven processing of Azure Data Factory pipelines. Part 3 covers the ADF wrapper necessary for our custom pipelines:

Firstly, to help guide this post below is a mock up of our Data Factory pipelines and activities to show the end goal. Hopefully this view informs how things are going to be connected using what I call a pipeline hierarchy system and how they will work in the overall framework. For our metadata processing framework we can make the following category distinctions about the activities represented:

Grandparent – This is the top level orchestration of our wider data platform solution. Here a scheduled trigger could be connected or processing in our solution grouped into natural areas. Technically this level isn’t required for our processing framework, but I’ve included it as good practice.
Parent – Our parent pipelines primary purpose (try saying that fast 3 times 🙂 ) is to handle the stages of our processing framework. The stages will then be passed off sequentially to our child pipeline using another execute pipeline activity.
Child – At this level in the framework the child is hitting the Azure Function to call the lowest level executors, or the pipelines that we want to actually do the work in our data platform solution. In my previous post I added some example metadata to call pipelines name Stage X-X. These in turn relate to a set of empty place holder pipelines that contained only Wait activities.

Part 4 puts it all together:

For the end to end run of the framework we have a few options to see progress once its been triggered. Before that its worth pointing out that in the below I’ve used the sample metadata provided with the database scripts in GitHub. Then for each execution pipeline I’ve added a single Wait activity with a random time delay of a few seconds. The point here is to test the framework execution, not the pipelines being called. To further clarify, the Stage X-X pipelines names should be replaced with your actual pipeline names in your data platform solution.

Definitely worth the read.

Comments closed