Press "Enter" to skip to content

Month: January 2018

Storing Credentials For Containers

Andrew Pruski shows how to store a credential using Powershell and pass it into a Docker container:

I work with SQL Server in containers pretty much exclusively when testing code and one of my real bug bears is that SQL Server in containers does not support Windows authentication (unless you’re using Windocks).

So when I’m working I find it quite annoying to have to specify a SA username & password when I want to connect.

OK, I can use Get-Credential, assign to a variable, and then reference that in a connection string but I want something a bit more permanent especially as I always use the same password for all my containers

Read on for Andrew’s method, and check out Rob Sewell’s method in the comments.

Comments closed

Using Powershell To Deploy Perfmon Collectors

Raul Gonzalez has a bonus post in his Perfmon data series:

As I said, when it’s time to deploy the solution explained in my previous posts to a number of servers it might get very tedious, specially if we have servers running multiple instances, since each have different counter names because the instance name is part of that name, and if we create one template, that won’t apply to all cases, so a lot of manual intervention.

So I decided to do what I like the most and got to write some queries that combined with some powershell will do the job for me.

Read on for the script and more information.

Comments closed

Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it:

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

Read on for more details.

Comments closed

Apache NiFi 1.5 Updates

Tim Spann shows off some nice additions to Apache NiFi:

Another cool processor that I will talk about in greater detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive processor with its Livy Controller Service gives you a much better alternative to my hacky REST integration to calling Apache Spark batch and machine learning jobs.

There are a number of enhancements, new processors, and upgrades I’m excited about, but the main reason I am writing today is because of a new feature that allows for having an Agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or download and then, you’ll use Apache Maven to install Apache NiFi Registry and start it. This process will become even easier with future Ambari integration for a CLI-free install.

To integrate the Registry with Apache NiFi, you need to add a Registry Client. It’s very simple to add the default local one — see below.

There are several new features in the latest release.

Comments closed

How Meltdown And Spectre Have Affected Spark Performance

Chris Stevens, et al, show how DAtabricks customers have fared in a post-Meltdown+Spectre world:

On AWS, we have observed a small performance degradation up to 5% since January 4th. On i3-series instance types, where we cache data on the local NVMe SSDs (Databricks Cache), we have observed a degradation up to 5%. On r3-series instance types, in which the benchmark jobs read data exclusively from remote storage (S3), we have observed a smaller increase of up to 3%. The greater percentage slowdown for the i3 instance type is explained by the larger number of syscalls performed when reading from the local SSD cache.

The chart below shows before and after January 3rd in AWS for a r3-series (memory optimized) and i3-series (storage optimized) based cluster.  Both tests fixed to the same runtime version and cluster size. The data represents the average of the full benchmark’s runtime per day, for a total of 7 days prior to January 3 (before is in blue) and 7 days after January 3 (after is in red). We exclude January 3rd to prevent partial results.  As mentioned, the i3-series has the Databricks Cache enabled on the local SSDs, resulting in roughly half of the total execution time (faster) compared to the r3-series results.

Overall, they’re seeing a degredation of 2-5%.  Click through for some more information on how they collected their metrics.

Comments closed

Comparing Data Lake Job Runs

Yanan Cai shows how to compare stats on different executions of a job:

Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what has been changed in the query, data or environment. This is followed by analysis to discover which changes may have performance impact. While this is happening production workloads continue to under-perform or go down.

Azure Data Lake Tools for Visual Studio now makes it easy to spot anomalies and quickly trace the key characteristics across recurring job instances allowing for an efficient debugging experience. The Pipeline Browser automatically groups recurring jobs to simplify discovery of all runs. The Related Job View collects data about inputs, outputs and execution across multiple runs into a single visualization.

Read on for more.

Comments closed

Out Of Memory Conditions And In-Memory OLTP

Ned Otter explains how a memory-optimized database can run out of its most critical resource:

What can cause a memory-optimized database to run out of memory? It could be that resource consumption (memory) exceeded:

  • the relevant percentage of committed_target_kb from the sys.dm_os_sys_info DMV (explained in a moment)
  • MAX_MEMORY_PERCENT value of a Resource Pool that the database is bound to (if running Enterprise Edition and using Resource Governor)

or:

  • garbage collection is not operational (the purpose of GC is to reclaim memory consumed by stale row versions)
  • updates to memory-optimized table variables caused row versions to be created, and because GC does not operate on table variables, you ran out of memory (for table variables that have a very large amount of rows)

The only thing that can prevent GC from working is a long running transaction.

This is the final post of Ned’s resource issues miniseries and it’s been a good one.

Comments closed

Visual Principles And Dashboards

I continue my series on dashboard visualization by looking at pictures:

In a bit more detail, you can make a dashboard glanceable by following these guidelines:

  1. Ensure that there is clear purpose in your metric design and display.  In other words, think about which metrics you want to show, how you want to show them, and where you put metrics in relation to one another.

  2. Group metrics by function into sections.  Look at the dashboard above.  It has four clusters of metrics:  those around revenue, new customers, revenue per customer, and customer acquisition cost.  All of the revenue metrics are clustered in the top-left quadrant of the dashboard.  Furthermore, all revenue-related metrics (that is, revenue metrics and revenue per customer metrics) are on the left-hand side of the dashboard, so the CEO can focus on that half and learn about revenue and revenue per customer.  She doesn’t need to look in the top-left corner for one revenue measure and in the bottom right for another; she can focus down to a portion of the dashboard and get an answer.

  3. It should be easy to see and differentiate those clusters of metrics.  Our natural instinct might be to put borders around the clusters, but whitespace is your friend—remember, less is more.  If you add a bit more whitespace between clusters of measures, you’ll make it easy for people to see that there’s a difference without distracting them with unnecessary lines.

I cover the Rule of Thirds, Glanceability, and Color Vision Deficiency, three important considerations for a designer.

Comments closed

SQL Server Backups To Azure Blob Storage

Kevin Hill shows  how to configure SQL Server to back up a database to Azure blob storage:

Note the “no blobs found” in the container.  After a successful backup, you will see it here.

Click on ‘Container Properties’ to get the URL for this specific container…this will be used in Backup and Restore statements.  Click the button next to the URL to copy it.  For now just remember where this is or copy it to Notepad, Query window etc.  When we start to build our T-SQL statements, we will need both the Access key from earlier and the URL.

Kevin gives clear, step by step instructions on the process.

Comments closed