Using Hive: Tiered Or Decoupled Storage?

Brandon Wilson and Gopal Vijayaraghavan compare a series of Hive queries against EC2 instances with persistent storage and S3:

There are advantages and disadvantages to each approach. The tiered approach has the most flexibility for an operator to tune the performance of the cluster while trading off size of the hot data zone for better performance or smaller resource footprint. The downside of this approach is that, having data on HDFS, resizing the cluster is a slow and tedious process due to HDFS needing to be rebalanced to achieve performance and fault-tolerance expectations. Thus this architecture is generally only used for statically sized clusters with steady, well-known workloads.

The decoupled architecture, on the other hand, enables maximum flexibility for cluster growth and reduction. For example, a cluster could run at 100 nodes during the day to support analytics and reporting and then shrink to 24 nodes overnight to support smaller ETL workloads. Historically, the disadvantage to decoupling is that cloud storage is not local and therefore could drastically affect runtime of the analytical workloads (hence the hybrid approach of tiered storage). However, the advent of LLAP in Hive 2.0 has limited this overhead making the approach far more attractive. The dynamic cache within LLAP also means that we do not need to statically define what data is hot. It can be inferred at query time (i.e., as users access the data, that data will become hot). We will look closer at how LLAP closes the runtime gap in the next section.

Historically, the argument was that you should avoid S3 in part because it’s relatively flaky compared to disks (in terms of performance and in its eventual consistency model).  It looks like that’s no longer a pressing concern.

Filtering On Kafka Streams

Robin Moffatt has a new series showing how to use Kafka Streams for dealing with syslog data:

syslog is one of those ubiquitous standards on which much of modern computing runs. Built into operating systems such as Linux, it’s also commonplace in networking and IoT devices like IP cameras. It provides a way for streaming log messages, along with metadata such as the source host, severity of the message, and so on. Sometimes the target is simply a local logfile, but more often it’s a centralised syslog server which in turn may log or process the messages further.

As a high-performance, distributed streaming platform, Apache Kafka® is a great tool for centralised ingestion of syslog data. Since Apache Kafka also persists data and supports native stream processing we don’t need to land it elsewhere before we can utilise the data. You can stream syslog data into Kafka in a variety of ways, including through Kafka Connect for which there is a dedicated syslog plugin.

In this post, we’re going to see how KSQL can be used to process syslog messages as they arrive in realtime.

Check it out.

Configuring Azure SQL Analytics

Esat Erkec has a guide showing how to configure and use Azure SQL Analytics on Azure SQL Database:

The most important and challenging responsibility of a database administrator is monitoring performance metrics. Because monitoring performance and troubleshooting performance issues are considered to be difficult. For this reason, we need diagnostic and monitoring tools to measure performance counters and metrics. For Azure SQL there is a tool which is named SQL Analytics. With this tool, we can measure and monitor Azure SQL databases and elastic pools. At the same time, we can create alerts for notifications. SQL Analytics offers performance metrics in graphical form. In this article, we will learn how to enable Azure SQL Analytics.

This is a long and screenshot-filled post, which is helpful if you’re getting started.

Powershell Splats And Quotes

Shane O’Neill shows when you need to use single quotes when splatting in Powershell:

If you’ve checked the examples in that post – and I recomment that you do – then you’ll see that it takes the syntax of Parameter = 'Value'.

Notice the Parameter portion is not in quotes? It also works perfectly well if you have the Parameter name in quotes e.g. 'Parameter' = 'Value' (double quotes works too).

Why would you use one instead of the other?

There is a special circle in the Inferno for people who put spaces in their parameter names.

Managing Central Management Server

Warren Estes explains how he keeps track of his servers using a Central Management Server:

We use a CMS server for each domain and I can’t imagine life without it. Kind of like when I discovered Amazon prime, or bought my first memory foam mattress.

The real magic of a CMS comes from being able to push jobs, or evaluate policies, on any server (targets) you want.

You can also execute T-SQL against all, or a subset of servers with either registered servers or CMS.

There are some caveats to look out for like collation differences and version specific DMVs when running queries across instances. Also security needs to be addressed. However, that is outside the scope of this post. You can find that information in the links in the first section.

I liked CMS when I had to deal with a dozen instances.  With hundreds of instances, I wouldn’t want to administer anything without one.

Powershell: CSVs And NoTypeInformation

Dan Blank lays down some information about saving headerless CSV files in Powershell:

Anyway, a question was posted recently using the #pshelp hashtag on Twitter.  How do you replace the header line of multiple csv files at once?

I saw this and maybe cockily thought to myself, “that’s got to be an easy 2 liner, bet I can boost my ego and quickly write out a solution”.

Spoilers:  it turned out to be slightly more complex.

Early Stats From SQLConstantCare Customers

Brent Ozar has some interesting and mostly disheartening statistics:

One of my favorite things about my job is being able to take the pulse of how real companies are managing their databases. I don’t wanna be locked in an ivory tower, preaching down to you, dear reader, about how you should be doing five million things a day – when in reality, you’re struggling to get an hour of real work done because you’ve got so many meetings.

But sometimes I wonder – am I out of touch? When I was a DBA, I remember struggling with backups and corruption checking – has that gotten easier? Have today’s DBAs started using more automation and tools to protect their estate? Is Transparent Data Encryption catching on? Did DBAs start using Extended Events for monitoring when I wasn’t looking?

And it’s important because I wanna build the right training material and scripts for our customers. I see a problem trending, I want to be able to give people the right information to fix the problem, fast.

Standard disclaimers about potential bias in samples apply, but it’s an interesting slice of the population.

Bucketing Tables By Size

Kevin Feasel

2018-04-06

T-SQL

Bill Fellows has an interesting approach to bucketing tables into groups of similar size:

You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance—it really doesn’t matter. What does matter is that you’d like to get it done sooner rather than later. If time is no consideration, then you’d likely just do one table at a time until you’ve done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You’re paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I’m staging data in SSIS, I often use a row count as an approximation for a time cost. It’s not perfect – a million row table 430 columns wide might actually take longer than the 250 million row key-value table.

Click through for the script.  For the R version, this Stack Overflow post shows how to do it with cumulative sums and the cut function.

Categories

April 2018
MTWTFSS
« Mar May »
 1
2345678
9101112131415
16171819202122
23242526272829
30