Author: Kevin Feasel

Alerting On Azure Data Lake Store Data Usage

Published 2018-06-29 by Kevin Feasel

Jose Lara shows off an interesting feature in Azure Data Lake Store:

The massive scale and capabilities of Azure Data Lake Store are regularly used by companies for big data storage. As the number of files, file types, and folders grow, things get harder to manage and staying compliant becomes a greater challenge for companies. Regulations such as GDPR (General Data Protection Regulation) have heightened requirements for control and supervision of files that contain sensitive data.

In this blog post, I’ll show you how to set up alerts in your Azure Data Lake Store to make managing your data easier. We will create a log analytics query and an alert that monitors a specific path and file type and sends a notification whenever the path or file is created, accessed, modified, or deleted.

Auditing access has historically been tricky, so it’s nice that they were able to get that in.

Comments closed

Updating SQL Agent Job Owners With dbatools

Published 2018-06-29 by Kevin Feasel

Stuart Moore gives us two methods of updating SQL Agent job owners, one using T-SQL and the other with dbatools:

Now we all know that having SQL Server Agent jobs owned by ‘Real’ users isn’t a good idea. But I don’t keep that close an eye on some of our test instances, so wasn’t surprised when I spotted this showing up in the monitoring:

The job failed. Unable to determine if the owner (OldDeveloper) of job important_server_job has server access (reason: Could not obtain information about Windows NT group/user 'OldDeveloper', error code 0x534. [SQLSTATE 42000] (Error 15404)).

Wanting to fix this as quickly and simply as possible I just wanted to bulk move them to our job owning account (let’s use the imaginative name of ‘JobOwner’).

Click through for both scripts.

Comments closed

FCB_REPLICA_SYNC Spinlock Explanation

Published 2018-06-29 by Kevin Feasel

Paul Randal explains what the FCB_REPLICA_SYNC spinlock is and what it does:

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.

If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Great information. And a good reminder that if you are using database snapshots in SQL Server, you generally don’t want to have more than one on the same database.

Comments closed

When Rowstore Compression Beats Columnstore

Published 2018-06-29 by Kevin Feasel

Joe Obbish looks at scenarios where page-level compression on rowstore tables can beat columnstore compression in terms of resultant table size:

It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).

Interestingly, Joe also comes up with a scenario where row-level compression can beat columnstore even without string values. All this said, the normal case when dealing with non-string data is that columnstore tends to compress a lot better.

Comments closed

Performance Tuning Window Functions

Published 2018-06-29 by Kevin Feasel

Kathi Kellenberger gives us some hints on tuning queries using window functions:

The only way to overcome the performance impact of sorting is to create an index specifically for the OVER clause. In his book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions, Itzik Ben-Gan recommends the POC index. POC stands for (P)ARTITION BY, (O)RDER BY, and (c)overing. He recommends adding any columns used for filtering before the PARTITION BY and ORDER BY columns in the key. Then add any additional columns needed to create a covering index as included columns. Just like anything else, you will need to test to see how such an index impacts your query and overall workload. Of course, you cannot add an index for every query that you write, but if the performance of a particular query that uses a window function is important, you can try out this advice.

There are some good insights here.

Comments closed

Enriching Syslog Data In A Kafka Pipeline

Published 2018-06-28 by Kevin Feasel

Robin Moffatt continues his syslog processing series with Kafka and KSQL:

In this article we’re going to conclude our fun with syslog data by looking at how we can enrich inbound streams of syslog data with reference information from elsewhere to produce a real-time enriched data stream. The syslog data in this example comes from various servers and network devices, and the additional information with which we’re going to enrich it is from MongoDB, which happens to be the datastore used by Ubiquiti network devices. With the enriched data we’re going to drive some real-time analytics through Elasticsearch and Kibana, as well as trigger push notifications based on activity of certain devices on the network.

I’ve enjoyed this series—it was a full, end-to-end look at a realistic business problem in Kafka Streams. If you want to get started with Kafka Streams, I’d be hard-pressed to find a better example.

Comments closed

Mapping In Scala When Dealing With Futures & Options

Published 2018-06-28 by Kevin Feasel

Shubham Verma shows us what happens when we use the map() and flatMap() functions in Scala on fields which are marked as Options or Futures:

now let’s move towards the interesting part flatMap(), what it is supposed to do in case of Option so the flatMap() gives us the liberty to return the type of value that we want to return after the transformation, unlike map() in wherein when a parameter has the value Some the value would be of type Some what so ever, but its not with the case with flatMap()
scala> option.flatMap(x => None)
res13: Option[Nothing] = None
scala>
scala> option.map(x => None)
res14: Option[None.type] = Some(None)
the code snippet above clearly shows it, so is that it ? No not yet lets look on to one more feature of flatMap() on Option[+A]that comes to be real handy when we need to extract value out of options, supposedly we have list of type List[Option[Int]] now I am only interested in values that have some value which seems to be an obvious usecase in most of the times, we can simply do it using a flatMap() operation

In short, it’s a little more complex, but you can still get useful information.

Comments closed

Taxi Cab Data On sqlite And Parquet

Published 2018-06-28 by Kevin Feasel

Mark Litwintschik loads the 1.1 billion rows of New York City taxi data into a SQLite database using data stored on Parquet-formatted files living on HDFS:

The dataset used in this benchmark has 1.1 billion records, 51 columns and is 500 GB in size when in uncompressed CSV format. Instructions on producing the dataset can be found in my Billion Taxi Rides in Redshift blog post. The CSV files were converted into Parquet format using Hive and Snappy compression on an AWS EMR cluster. The conversion resulted in 56 Parquet files which take up 105 GB of space.

Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead.

I’ve downloaded the Parquet files to my local file system and imported them onto HDFS. Since this is all running on a single SSD drive I’ve set the HDFS replication factor to 1.

It’s not the fastest result I’ve seen from Mark’s work, but I was impressed that SQLite could take that abuse.

Comments closed

Studying Performance Problems With Web Applications

Published 2018-06-28 by Kevin Feasel

Adrian Colyer reviews a paper on fixing performance problems with ORMs:

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

Much of this is straightforward, but worth the read.

Comments closed

Graphics In R

Published 2018-06-28 by Kevin Feasel

David Smith is following the kerfuffle that Edward Tufte unleashed on Twitter recently:

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

I think Tufte’s off base here.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30