Press "Enter" to skip to content

Month: March 2018

Dropping Columns With Logstash

Mike Hillwig shows how to ignore columns with Logstash:

Like I said earlier, we have some data that I know I’ll never use. This is flight performance data. The dataset contains diversion information. If a flight gets diverted more than once, it’s tracked here. I don’t care about that, so I’m dropping the diversion information for the second through fifth diversions. I’m also dropping some information about the airports that I believe I won’t need. This is the tricky part. Somewhere down the road, I’m going to need to enhance this data by converting all of the times to UTC.

Mike’s slowly building up to a complete, working example and it’s interesting to watch the progress along the way.

Comments closed

Columnstore And Merge Replication

Niko Neugebauer tests whether merge replicated tables can use columnstore indexes:

Adding this table to the publication will end up with the following, self-explaining error message, being very clear that the Clustered Columnstore Indexes are not supported for the Merge Replication[.]

There is no surprise here, as the same Clustered Columnstore Indexes are not supported for the Transactional Replication, but I feel that a great opportunity is lost and the Replication technology are being quite ignored by the emerged technologies, such as In-Memory & Columnstore, where the scenarios of replicating the Data Warehousing data is something that a lot of people can find very useful.

I wish it would be otherwise, and this would allow to bring more customers to use Columnstore Indexes.

Clustered columnstore indexes aren’t possible, but read on to learn whether non-clustered columnstore indexes are supported.

Comments closed

Trace Flag 834 And Columnstore Tables

Joe Obbish shows how trace flag 834 can solve a bottleneck when inserting into tables with clustered columnstore indexes:

In my experience, when we get into a situation with high memory waits caused by too much concurrent CCI activity all queries on the server that use a memory grant can be affected. For example, I’ve seen sp_whoisactive run for longer than 90 seconds.

It needs to be stated that not all CCIs will suffer from this scalability problem. I was able to achieve good scalability with some artificial tables, but all of the real target tables that I tested have excessive memory waits at high concurrency. Perhaps tables which require more CPU to compress naturally spread out their memory requests and the underlying OS is better able to keep up.

Read the whole thing, and also check out Lonny Niederstadt’s comment as it adds pertinent information about TF834.

Comments closed

Avoding Direct View() Calls In R

John Mount notes that you should not assume that the View() function in R will work:

R tip: get out of the habit of calling View() directly.

View() only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to View(), or to something else depending if you are in an interactive or non-interactive session.

Click through for a script which is safe to run whether you’re in R Studio or using knitr to build a document.

Comments closed

Recommendations For Running Kafka On AWS

Prasad Alle has some recommendations if you decide to run Apache Kafka on AWS:

The network plays a very important role in a distributed system like Kafka. A fast and reliable network ensures that nodes can communicate with each other easily. The available network throughput controls the maximum amount of traffic that Kafka can handle. Network throughput, combined with disk storage, is often the governing factor for cluster sizing.

If you expect your cluster to receive high read/write traffic, select an instance type that offers 10-Gb/s performance.

In addition, choose an option that keeps interbroker network traffic on the private subnet, because this approach allows clients to connect to the brokers. Communication between brokers and clients uses the same network interface and port. For more details, see the documentation about IP addressing for EC2 instances.

If you are deploying in more than one AWS Region, you can connect the two VPCs in the two AWS Regions using cross-region VPC peering. However, be aware of the networking costs associated with cross-AZ deployments.

There’s some good advice here, as well as acknowledgement of various tradeoffs involved in architecting a solution.

Comments closed

Module Signing For Database Rights

Solomon Rutzky shows how to use module signing to grant granular permissions to users:

Scenario: We want to allow one or more Users and/or Database Roles to be able to truncate certain Tables, but not all Tables. We certainly do not want to allow anyone the ability to make structural changes to the Table.

Also, it is likely that, over time, at least one more Tables will be added that the User(s) and/or Role(s) should be able to truncate, and less likely, though not impossible, that one or more tables that they should be able to truncate now might be removed.

Truncation is a great example of the kind of right you’d want behind a signed stored procedure, as the level of right necessary to truncate a table is absurd:  practically full control of the table.  Module signing is something that I wish more DBAs knew and implemented.

Comments closed

Executing R Scripts In SSRS

Tomaz Kastrun shows how to include R scripts (and visuals) in SQL Server Reporting Services:

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

It’s nice to be able to use R to create nice visuals and then import them in your SSRS report, and Tomaz shows how.

Comments closed

Explaning RANKX In DAX

Philip Seamark explains how the RANKX function works:

One of the first traps to encounter when using this function is the function can be used in calculations for calculated columns as well as calculated measures.   The RANKX function will still do what it is asked.  The trick is how you use the function in each scenario – and more importantly, what filters are going to be implicitly applied to the data the RANKX function actually uses.

This is a helpful post for explaining the ranking function.

Comments closed

All Execution Plans Are Estimates

Grant Fritchey drops a bomb on us:

All these resources, yet, for any given query, all the plans will be identical (assuming no recompile at work). Why? Because they’re all the same plan. Each and every one of them is an estimated plan. Only an estimated plan. This is why the estimated costs stay the same between an estimated and actual plan, this despite any disparity between estimated and actual row counts.

I’ve blogged about this before, but it’s worth mentioning again. There are a only a few minor differences between an estimated plan and an actual plan. It’s all about the data set. What’s going on is that an actual plan can capture query metrics, which are then appended to the estimated plan. At no point is any different plan generated during this process. It’s just a plan, an estimated plan, or, it’s a plan plus query metrics.

Read the whole thing.

Comments closed

Getting A Random Row

Brent Ozar shares four methods for getting a random row from a table:

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Click through for the other three methods.  The really tricky part is when you want to get a random sample from the table, as TABLESAMPLE is an awful choice for that.

Comments closed