Press "Enter" to skip to content

Day: March 10, 2020

Sqoop Scheduling and Security

Jon Moirsi continues a series on Sqoop:

In previous articles, I’ve walk through using Sqoop to import data to HDFS.  I’ve also detailed how to perform full and incremental imports to Hive external and Hive managed tables.

In this article I’m going to show you how to automate execution of Sqoop jobs via Cron.

However, before we get to scheduling we need to address security.  In prior examples I’ve used -P to prompt the user for login credentials interactively.  With a scheduled job, this isn’t going to work.  Fortunately Sqoop provides us with the “password-alias” arg which allows us to pass in passwords stored in a protected keystore.

That particular keystore tie-in works quite smoothly in my experience.

Comments closed

Monitoring Data Quality on Streaming Data

Abraham Pabbathi and Greg Wood want to check data quality on Spark Streaming data:

While the emergence of streaming in the mainstream is a net positive, there is some baggage that comes along with this architecture. In particular, there has historically been a tradeoff: high-quality data, or high-velocity data? In reality, this is not a valid question; quality must be coupled to velocity for all practical means — to achieve high velocity, we need high quality data. After all, low quality at high velocity will require reprocessing, often in batch; low velocity at high quality, on the other hand, fails to meet the needs of many modern problems. As more companies adopt streaming as a lynchpin for their processing architectures, both velocity and quality must improve.

In this blog post, we’ll dive into one data management architecture that can be used to combat corrupt or bad data in streams by proactively monitoring and analyzing data as it arrives without causing bottlenecks.

This was one of the sticking points of the lambda architecture: new data could still be incomplete and possibly wrong, but until reached the batch layer, you wouldn’t know that.

Comments closed

Two Query Store Stories

Mark Wilkinson gives us two separate takes on Query Store:

When the Query Data Store (QDS) feature was announced for SQL Server 2016, we were excited about the prospect of being able to have deep insight on any query running in our environment. I work for a company that deals heavily in the e-commerce space, and we have a large SQL Server footprint. Our environment is unique in that is essentially a multi-tenant system, but all the tenants could have wildly different workloads. It’s really the kind of query execution scenario QDS was built for. We had the pleasure of working with the Microsoft SQLCAT team to get 2016 and QDS up and running in our production environment before it was GA.

In this post I’m going to share two stories about our QDS experience (from pre and post GA of the feature). One from the perspective of the Database Developer, and one from the Database Administrator. For the most part this is not a technical post full of queries and code samples. It’s just me talking about some things I have experienced using QDS in production.

CentralQDS, by the way, is really cool. Hopefully we’re able to show that to the world someday (and note how I say “we” even though I did absolutely nothing with it except for being in the same company as the people who developed it).

Comments closed

Improving vCenter Performance Metric Logging

David Klee has some recommendations on settings for vCenter performance metric collection:

The default performance metric statistics collection interval within vCenter is to start rolling up data in an aggregation method starting at just one hour. Much of the data necessary for troubleshooting of performance challenges reported either same day or on the previously day is lost from the vCenter data and forces the administrator to revert to cumbersome and/or time-consuming tooling, such as vRealize Operations Manager. DBAs might not have access to such tools. Hopefully by now they have read-only access to vCenter!

The vCenter performance statistics collection and rollup settings can be customized to provide a longer window of time for critical metrics to be available to the administrator for management.

Click through for some recommendations of aggregation intervals and collection durations to help with virtual machine troubleshooting.

Comments closed

Loading Data from CSVs with Inconsistent Quoted Identifiers

Dave Mason has some fun with loading data from files:

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I’ve blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I’ve never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I’ve never found a way around this limitation.

Let’s demonstrate with BCP. Below is a sample data file I’ll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

I get unduly frustrated with the implementations of various data loaders around SQL Server and how they handle quoted identifiers differently. And don’t get me started on PolyBase.

Comments closed

Azure Data Studio and Query Store

Rob Farley explains why Azure Data Studio can’t be a 100% thing right now:

But the big thing with this new laptop is that I’ve made a conscious decision about what to install on it. And particularly, what things to NOT install.

For example, I’ve only installed SQL inside docker, not directly on Windows. I’m running the Linux version just because I can.

But today I feel like I’ve compromised.

Today I’ve installed SSMS, instead of persisting with just Azure Data Studio. It only took a week for me to cave, and the reason is Query Store.

Read on to understand why. I ended up having to cave on Docker-only for SQL Server because of PolyBase components that I needed and which aren’t (yet?) supported in the Linux version.

Comments closed

Finding Long-Running Queries with system_health

Grant Fritchey shows us where we can find long-running queries easily:

Wouldn’t it be great to just quickly and easily take a look at your system to see if you had any queries that ran for a long time, but, without actually doing any work to capture query metrics?

Oh, yeah, I can do that right now, and so can you.

All we need is something that is built into every single server you currently have under management (sorry, not Azure SQL Database) that is SQL Server 2008 or better: system_health

Grant then ties this into general benefits of Extended Events and shows how you can query and view the results.

Comments closed

Dynamic Binning with Power BI

Reza Rad has fun with dynamic binning in Power BI:

In the previous article/video, I explained how to create dynamic bins by choosing the count of bins in a slicer in the Power BI report. In this article, I’ll explain, how you can do it the other way around, which is by selecting the size of the bin, you will have bins and buckets dynamically generated.

I like this for its ability to let you select the proper number and size of bins when Power BI is being particularly obstinate about something. In an ideal world, I don’t like this so much as a user-facing feature because we as designers should know the proper number and size of bins.

Comments closed