Press "Enter" to skip to content

Author: Kevin Feasel

Running A Model On Separate Groups Of Data

Simon Jackson shows how to run the same model against separate groups of data in R:

Now that we can separate data for each group(s), we can fit a model to each tibble in data using map() from the purrr package (also tidyverse). We’re going to add the results to our existing tibble using mutate() from the dplyr package (again, tidyverse). Here’s a generic version of our pipe with adjustable parts in caps:

Read the whole thing.  Hat tip, R-Bloggers.

Comments closed

Uncontrolled Environments

Ed Elliott discusses database deployments in uncontrolled environments:

There have been a few discussions on stack overflow recently about how to manage deployments in uncontrolled environments, specifically data migrations. The questions were from an SSDT perspective, I don’t think that SSDT is a great choice for these uncontrolled environments and there are some additional requirements for these uncontrolled environments that need some additional thought and care when creating release scripts (whether manually or using a tool).

Ed has some interesting thoughts here, and I agree with the idea that SQL Server Data Tools deployment scripts are not the best choice when you have people changing schema all around you in unexpected ways.

Comments closed

Why Force Query Store Plans

Grant Fritchey explains the wherefore behind query store plan forcing:

But, what else does Force Plan do for you? What if you never experience bad parameter sniffing (you do, but I’m not going to argue the point)? Is there something else that Force Plan can do for you? Heck yes! The whole point of creating the Query Store was in order to address Plan Regression. What the heck is plan regression? When Microsoft makes any change to the Query Optimizer, and those changes come all the time, it’s possible that you might see a change in your execution plans. Most of the time, it’s going to be a positive change. That’s why they’re changing the Optimizer after all, to improve it. However, sometimes, you’re benefiting from the old behavior of the Optimizer and that new plan doesn’t work as well as the old plan. This is plan regression. When Microsoft changed the Cardinality Estimation engine in SQL Server 2014, more than a few people experienced the new estimator giving row estimates that resulted in a different execution plan that didn’t perform as well as the old plan. This is plan regression. What to do?

This is a good read.

Comments closed

Error Handling With Extended Events, Part 2

Dave Mason continues his discussion of using Extended Events to handle errors:

In the last post, we explored a couple of examples of using Extended Events to enhance T-SQL error handling. There was some potential there. But a hard-coded SPID was necessary: we couldn’t use the code examples for anything automated. It was cumbersome, too. Let’s change that, shall we?

To make the code easier to work with, I moved most of it into three stored procs: one each to create an XEvent session, get the XEvent session data, and drop the XEvent session. There’s also a table type. This will negate the need to declare a temp table over and over. The four objects can be created in any database you choose. I opted to create them in [tempdb]. The code for each is below in the four tabs.

This is a very interesting solution.

Comments closed

Kinesis Analytics

Ryan Nienhuis shows how to implement Amazon Kinesis Analytics:

As I covered in the first post, streaming data is continuously generated; therefore, you need to specify bounds when processing data to make your result set deterministic. Some SQL statements operate on individual rows and have natural bounds, such as a continuous filter that evaluates each row based upon a defined SQL WHERE clause. However, SQL statements that process data across rows need to have set bounds, such as calculating the average of particular column. The mechanism that provides these bounds is a window.

Windows are important because they define the bounds for which you want your query to operate. The starting bound is usually the current row that Amazon Kinesis Analytics is processing, and the window defines the ending bound.

Windows are required with any query that works across rows, because the in-application stream is unbounded and windows provide a mechanism to bind the result set and make the query deterministic. Analytics supports three types of windows: tumbling, sliding, and custom.

The concepts here are very similar to Azure’s Stream Analytics.

Comments closed

Improving HBase Cluster Restart Time

Nitin Verma explains how to re-create an HBase cluster a bit faster:

When flush ‘table’ operation is triggered, all the regions belonging to that table will flush independently. Once the HFile corresponding to a region is flushed, it records the max sequence id in metadata and notifies the WAL corresponding to the regionserver. WAL maintains a mapping table for regions and their corresponding flushed sequence id’s. When the HBase cluster restarts, the hMaster will distribute flushed sequence id’s per region to the recovery threads splitting the WAL, so that they can skip the edits which have already been persisted in HFiles.

This is particularly important for clusters which frequently spin up and down, a feature of Platform-as-a-Service solutions like HDInsight.

Comments closed

S3 Or EBS?

Devadutta Ghat, et al, compare Amazon S3 versus Elastic Block Storage (EBS) on the basis of cost and Apache Impala performance:

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

  • HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem
  • OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again
  • Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

It looks like EBS is a solid choice for many workloads.

Comments closed

Azure Automation

Melissa Coates explains Azure Automation:

Azure Automation is a cloud service in Microsoft Azure which let you schedule execution of PowerShell cmdlets and PowerShell workflows. Azure Automation uses the concept of runbooks to execute a set of repeatable, repetitive tasks via PowerShell. Consistency in execution, reduction of errors, and of course saving time, are all key objectives – which makes DBAs and system admins happy, eh?

This is a higher-level discussion including some good tips on the product.

Comments closed