Press "Enter" to skip to content

Month: October 2016

Azure Data Lake Analytics Units

Yan Li explains the Azure Data Lake Analytics Unit:

An Azure Data Lake Analytics Unit, or AU, is a unit of computation resources made available to your U-SQL job. Each AU  gives your job access to a set of underlying resources like CPU and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM. As we see how people want to use the service, we may change the definition of an AU or more options for controlling CPU and memory usage.

How AUs are used during U-SQL Query Execution

When you submit a U-SQL script for execution, the U-SQL compiler parallelizes the U-SQL script into hundreds or even thousands of tasks called vertices. Each vertex is allocated to one AU. The AU is dynamically allocated to the task and released once that particular task is completed.

I appreciate the ADL team’s transparency in how they define a unit.  It’s much nicer to be able to tell someone that an AU is 2 CPU cores + 6 GB of RAM, rather than saying it’s some fuzzy measure of CPU + memory + I/O which has no direct bearing on your operations.

Comments closed

SSISDB Maintenance

Jesse Seymour shows how to trim the SSIS catalog size:

The options we are interested in are OPERATION_CLEANUP_ENABLED and RETENTION_WINDOW.  By default, RETENTION_WINDOW is 365. and OPERATION_CLEANUP_ENABLED is TRUE.

Since we want to set our retention window to 10 days, we need to update RETENTION_WINDOW to 10.  We could do this with a simple update statement, but Microsoft provides us with a stored procedure that will do that for us.  The benefit of the stored procedure over the UPDATE statement is that a vendor-provided stored procedure will typically encapsulate any additional steps required.

I do not at all like the idea of running SHRINKDATABASE and definitely wouldn’t have that plus a backup in the deletion loop, but if you get caught in a nasty situation with SSISDB, this can serve as the starting point for digging yourself out.

Comments closed

Linked Server To Access

Jana Sattainathan walks through issues with setting up a linked server connection to an Access database:

Normally, it is easy enough to setup a Linked Server on SQL Server to other data sources. Problems are usually caused by one of the usual culprits that have to be addressed

  • SQL Logins simply do not work well when trying to do this type of setup

  • The Windows login has to have permissions to the file (on a drive or network share)

  • The appropriate drivers have to be setup (64 bit / 32 bit)

Read on for a few different errors and their solutions.

Comments closed

Database Throughput Units

Randolph West looks at the Azure Database Throughput Unit Calculator:

The DTU Calculator, a third-party service created by Justin Henriksen (a Microsoft employee), will calculate the DTU requirements for our on-premises database that we want to migrate to Azure, by firstly capturing a few performance monitor counters, and then performing a calculation on those results, to provide the recommended service tier for our database.

Justin provides a command-line application or PowerShell script to capture these performance counters:

  • Processor – % Processor Time

  • Logical Disk – Disk Reads/sec

  • Logical Disk – Disk Writes/sec

  • Database – Log Bytes Flushed/sec

For more details on DTUs, John Sterrett looks at the math.

Comments closed

Don’t Over-Index

Kenneth Fisher shows the downside cost of non-clustered indexes:

I want to make a couple of final points. I realize 99 indexes is a lot. It’s to emphasize the differences. However they were also fairly small indexes and this is a single table where a normal database might easily have hundreds. So take these results as an example. They aren’t going to match real life but will hopefully show you how all of this can play out.

Indexes are awesome but you want to be smart about adding them. My personal rule of thumb, with no scientific evidence behind it, is 5 indexes or less and I’m pretty easy. 5-10 indexes and you’ll have to convince me. I’m going to be reviewing the existing indexes and see what I can get rid of, or maybe I can combine something. Past 10 indexes and it had best be for a query that’s running a 100+ times a minute or something for the CEO.

Read on for demo code and specific results.

Comments closed

SQL Server R Services Memory Usage

Ginger Grant looks at how SQL Server R Services handles memory allocation:

While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.

If you’re concerned about R Services taking up too much server memory, you should look at Resource Governor.

Comments closed

Securing Solr Collections

Jan Kunigk and Paul Wilkinson show how to secure Solr collections:

The policy shown below establishes four Sentry roles based on the admin, operators, users, and techusersgroups.

  • Administrators are entitled to all actions.

  • Operators are granted update and query privileges.

  • Users are granted query privileges.

  • Tech users are granted update privileges.

These are pretty straightforward role-based access controls.  The authors also look at accessing the data via Flume and a couple other technologies.

Comments closed

Data Curation

Christina Prevalsky makes the case for data curation:

The gaining popularity of self-service analytical tools such as Tableau increases the necessity of having curated data in your database. These tools aim to allow the end users to intuitively query data “at the speed of thought” from the data warehouse and visualize the results quickly. That type of capability allows users to go through several different iterations of the data to really explore the data and generate unique insights. These tools do not work well when the underlying database tables have not been curated properly.

This is a difficult and lengthy process, but it’s vital; data minus context is a lot less relevant than you’d hope.

Comments closed

Optimizing S3 For High Concurrency

Aaron Friedman looks at how to optimize highly-concurrent, distributed workloads writing data to S3 buckets:

S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. Due to its underlying infrastructure, S3 is excellent for retrieving objects with known keys. S3 maintains an index of object keys in each region and partitions the index based on the key name. For best performance, keys that are often read together should not have sequential prefixes. Keys should be distributed across many partitions rather than on the same partition.

For large datasets like genomics, population-level analyses of these data can require many concurrent S3 reads by many Spark executors. To maximize performance of high-concurrency operations on S3, we need to introduce randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.

Reading the title, I wanted it to be an article on knobs to turn in S3 to maximize read performance.  It’s still an article well worth reading, but focuses from the other side:  how to write to S3 without stepping on your own toes.

Comments closed

Working With Topics In Kafka

I show how to do the basics of creating, deleting, and pushing messages on topics in Apache Kafka:

There are three important things here:  first, our Zookeeper port is 2181.  Zookeeper is great for centralized configuration and coordination; if you want to learn more, check out this Sean Mackrory post.

The second bit of important information is how long our retention period is.  Right now, it’s set to 7 days, and that’s our default.  Remember that messages in a Kafka topic don’t go away simply because some consumer somewhere accessed them; they stay in the log until we say they can go.

Finally, we have a set of listeners.  For the sandbox, the only listener is on port 6667.  We connect to listeners from our outside applications, so knowing those addresses and ports is vital.

This is still quick-start level stuff, but I’m building up to custom development, honest!

Comments closed