Press "Enter" to skip to content

Day: April 27, 2017

Securing Azure Storage

Christos Matskas has an article on securing Azure blobs and containers:

All communication with the Azure Storage via connection strings and BLOB URLs enforce the use of HTTPS, which provides Encryption in Transit. You can enforce the use of “Always HTTPS” by setting the connection string like this: “DefaultEndpointsProtocol=https;AccountName=myblob1…” or in SAS signatures, as in the example below:

https://myblob1.blob.core.windows.net/?sv=2015-04-05&ss=bfqt&srt=sco&sp=rwdlacup&se=2016-09-22T02:21:41Z&st=2016-09-21T18:21:41Z&spr=https&sig=hxInpKBYAxvwdI9kbBglbrgcl1EJjHqDRTF2lVGeSUU%3D

To protect data at rest, the service provides an option to encrypt the data as they are stored in the account. There’s no additional cost associated with encrypting the data at rest and it’s a good idea to switch it on as soon as the account is created. There is a one-click setting at the Storage Account level to enable it, and the encryption is applied on both new and existing storage accounts. The data is encrypted with AES 256 cipher and it’s now generally available to all Azure regions and Azure clouds (public, government etc)

There’s some good information here, making it worth the read.

Comments closed

Survival Analysis

Joseph Rickert explains what survival analysis is and shows an example with R:

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

Survival analysis is an interesting field of study.  In engineering fields, the most common use is calculating mean time to failure, but that’s certainly not the only place you’re liable to see it.

Comments closed

Outliers In Histograms

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% 
  mutate(x_new = ifelse(x > 10, 10, x)) %>% 
  ggplot(aes(x_new)) +
  geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

Comments closed

Handling Rogue Queries In Spark

Alicja Luszczak, et al, introduce the Query Watchdog:

The previous query would cause problems on many different systems, regardless of whether you’re using Databricks or another data warehousing tool. Luckily, as an user of Databricks, this customer has a feature available that can help solve this problem called the Query Watchdog.

Note: Query Watchdog is available on clusters created with version 2.1-db3 and greater.

A Query Watchdog is a simple process that checks whether or not a given query is creating too many output rows for the number of input rows at a task level. We can set a property to control this and in this example we will use a ratio of 1000 (which is the default).

It looks like this is an all-or-nothing process, but a very interesting start.

Comments closed

Deploying Reports With Powershell

Jana Sattainathan has created a few Powershell functions to automate dealing with SQL Server Reporting Services report deployment:

In this post, I want to publish a few functions that I created around SSRS. They are related to and depend on each other.

  • Get-SSRS – Given the SSRS URI returns the WSDL endpoint

  • Get-SSRSReport – Returns one or more reports based on inputs

  • Get-SSRSSharedDataSource – Returns one or more shared data sources based on inputs

  • Get-SSRSReportDataSource – Returns the data source information on a report by report basis based on inputs

  • Set-SSRSReportDataSource – Sets the data source of a report to the given data source.

  • Install-SSRS – Deploys an SSRS report to a specific folder and also optionally sets the datasource for the deployed report

Very useful.

Comments closed

Cloudera Accessing Azure Data Lake Store

The Azure Data Lake team has announced that you can now access Azure Data Lake Store using a Cloudera cluster:

The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.

Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!

That’s an interesting development.

Comments closed

Code Formatting

Bert Wagner has a few tips on code formatting to make it more readable:

The second example above consistently indents lines, adds new lines, and follows consistent coding patterns. This makes it easy to skim the code quickly.

Books have chapters, headings, and paragraphs defined by formatting that make it easy to find what is needed at a glance — formatting code makes it possible to find things easily too.

The examples Bert uses are all C#, but apply to most languages.  I think consistency is key, even more so than your ideal format.  This reduces friction between developers, at least outside of the “what should our coding standards be?” meetings…

Comments closed

Removing Tempdb Files

Erin Stellato relates a painful experience of tempdb filling up a VM’s drive:

I made a mistake with a script today. I created three new tempdb files sized at 10GB each that filled up a hard drive.

Whoops.

Luckily it was in one of my own testing VMs, so it wasn’t awful. Fixing it, however, was a fun one.

**NOTE: All work was done in a test environment. Proceed with caution if you’re running these commands in Production and make sure you understand the ramifications.

It’s a good opportunity to learn from Erin’s experience.

Comments closed

Getting The SqlServer Powershell Module

Drew Furgiuele shows how to install the SqlServer Powershell module from the Powershell Gallery:

That’s because, out of the box, Server 2012 R2  is running PowerShell 4.0. These Gallery cmdlets require PowerShell 5. To upgrade, you either need to upgrade to PowerShell 5.0 and that means installing Windows Management Framework 5.0. This is compatible with versions of Windows as far back as Windows 7, and Windows Server as far back as 2008 R2. Anything earlier, and you’re out of luck. This also requires the .NET framework 4.5 (or above). That means system updates, which could (potentially) lead to system reboots. Plan (and for the love God test) accordingly!

There’s a couple other hitches as well. One, and this sort of goes without saying, you need internet access for this to work. If your machines are behind any kind of filtering or firewall restrictions that prevent them from talking out to the internet, you’ll need to either open them up or use the Save-Module  feature to download the bits and install them yourself. Secondly, you need Administrator access for this to work. And three, if you do install them manually, you might have different versions installed for different users (or service accounts).

They’ve made it nice and easy, so read Drew’s post and give it a try.

Comments closed

Partitioned Columnstore Tables

Denny Cherry makes an important point about dealing with columnstore tables:

ColumnStore indexes are all the rage with data warehouses. They’re fast, they’re new(ish) and they solve all sorts of problems when dealing with massive amounts of data.  However they can cause some issues as well if you aren’t very careful about how you setup your partitions on the ColumnStore index.  This is because, you can’t split a ColumnStore partition once it contains data.

Now, if everything is going according to plan you create your partitions well in advance and there’s no issues.

However, if everything hasn’t gone according to plan and someone forgets to create the partitions and you end up with rows in the final partition, you can’t create any more partitions because you can’t split the partition.

Ideally, you get those ducks in a row first.  Keep reading for a repro script and a couple potential workarounds.

Comments closed