Press "Enter" to skip to content

Month: September 2018

Switching To Managed Disks In Azure

Chris Seferlis walks us through an easy method to convert unmanaged disks to managed disks in Azure:

First off, why would you want a managed disk over an unmanaged one?

  • Greater scalability due to much higher IOPs and storage limits. There’s no longer the need to add additional storage accounts when you’re adding disk space, which has been a challenge for users that were using large virtual machines and required large storage space.

  • Better availability and reliability which ensures that disks are now isolated from each other in different storage scale units.

  • Managed disks offer an over 99.99% uptime, plus are always stored with 3 replicas of the data.

  • More granular access control by employing role-based access control (RBAC) security. You have granular capability to assign access to various people in your organization.

Keep reading to learn how to switch.

Comments closed

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark:

Time series analysis has two components: time series manipulation and time series modeling.

Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data cleaning and feature engineering. Typical functions in time series manipulation include:

  • Joining: joining two time-series datasets, usually by the time
  • Windowing: feature transformation based on a time window
  • Resampling: changing the frequency of the data
  • Filling in missing values or removing NA rows.

Time series modeling is the process of identifying patterns in time-series data and training models for prediction. It is a complex topic; it includes specific techniques such as ARIMA and autocorrelation, as well as all manner of general machine learning techniques (e.g., linear regression) applied to time series data.

Flint focuses on time series manipulation. In this blog post, we demonstrate Flint functionalities in time series manipulation and how it works with other libraries, e.g., Spark ML, for a simple time series modeling task.

Basho went all-in on a time-series product for Riak and it did not work out well for them.  I’ll be curious to see if Flint has more staying power.

Comments closed

Against Multi-Cloud Models

Tyler Treat argues against companies looking at multi-cloud models:

A multi-cloud strategy looks great on paper, but it creates unneeded constraints and results in a wild-goose chase. For most, it ends up being a distraction, creating more problems than it solves and costing more money than it’s worth. I’m going to caveat that claim in just a bit because it’s a bold blanket statement, but bear with me. For now, just know that when I say “multi-cloud,” I’m referring to the idea of running the same services across vendors or designing applications in a way that allows them to move between providers effortlessly. I’m not speaking to the notion of leveraging the best parts of each cloud provider or using higher-level, value-added services across vendors.

Multi-cloud rears its head for a number of reasons, but they can largely be grouped into the following points: disaster recovery (DR), vendor lock-in, and pricing. I’m going to speak to each of these and then discuss where multi-cloud actually does come into play.

It’s an interesting article.  I think that Tyler is right, but that companies should be capable of switching between cloud providers or even creating hybrid approaches should the need arise.

Comments closed

An Update To ssisUnit

Bartosz Ratajczyk has added some functionality to ssisUnit:

Second – you can get and set the properties of the project and its elements. Like – overwriting project connection managers (I designed it with this particular need on my mind). You can now set the connection string the different server (or database) – in the PropertyPath of the PropertyCommand use \Project\ConnectionManagers, write the name of the connection manager with the extension, and use one of the Properties. You can do it during the Test setup (or all tests setup), but not during the test suite setup, as ssisUnit is not aware of the project until it loads it into the memory.

Good on Bartosz for resurrecting a stable but moribund project and adding some enhancements.

Comments closed

Ad Hoc Functions In T-SQL

Riley Major shows a couple techniques for including ad hoc functions in T-SQL, namely Common Table Expressions and the APPLY operator:

It’s helpful to think of each APPLY as a pipe operation, taking the values from the previous derived table and passing them into the next to be manipulated. Programming T-SQL in this manner (loosely) approximates modern functional programming techniques.

It keeps each step of the logic smaller, so that it’s easier to understand. And you can expose the intermediary columns to help with debugging.

This is one of my favorite uses of the APPLY operator, as it lets you think through a problem step-by-step while still allowing the optimizer to create a set-based solution for you.

Comments closed

Limiting Azure Administrator Data Access

Melissa Coates gives us a look at one aspect of Azure security:

Recently a customer expressed concern that an owner of an Azure resource group automatically gains access to the data within the services contained in the resource group. In this case, the customer was specifically referring to data in Azure Data Lake Storage Gen 1 but this concept applies to Azure Storage and other data-oriented services in Azure as well. The customer’s comment prompted me to look into available alternatives. This is by no means a detailed security post…rather, I’m trying to share a few nuggets of what I learned.

Worth the read.  Much of the latest round of regulatory push seems to be in the realm of limiting high-access insiders (like DBAs) from accessing sensitive information, and this post aligns with that.

Comments closed

Forcing MAXDOP In Azure SQL DB

Arun Sirpal shows us that you can change MAXDOP in Azure SQL Database:

In this quick post I will show you my parallel plan and how I use MAXDOP = 1 to suppress parallel plan generation so the operation will be executed serially. (Disclaimer – I am not saying this is the right thing to do, merely using it as an example of tweaking this setting, to be honest in 10 years I have changed MAXDOP = 1 twice). I executed a query in Azure. You can see the classic operators such as gather streams and repartition streams.

This change will affect all queries hitting that database, so it’s a coarser tool than changing cost threshold for parallelism (not allowed) or setting MAXDOP per-query (allowed).

Comments closed

Using Table-Valued Parameters With sp_executesql

Kenneth Fisher shows how to include table-valued parameters in a dynamic SQL query:

Recently I did a presentation on dynamic SQL. In the presentation I pointed out the similarity of using sp_executesql to creating a stored procedure to do the same task. After the session I was asked: If that’s the case, can I pass a TVP (table valued parameter) into sp_executesql?

Awesome question! Let’s give it a shot.

Read on to see how to do this.

Comments closed

Comparing Instance Configurations With sys.configurations

Jana Sattainathan has a script to compare two SQL Server instances’ configuration settings:

Steps:

1. Create a Linked Server to 2nd instance on 1st Instance
2. Create an empty table to hold the comparison data
3. Insert the 1st instance sys.configurations data into the table
4. Insert the 2nd instance sys.configurations data into the table using Linked Server
5. Compare to get the differences
6. Cleanup – Drop Linked Server and the temporary comparison table

Click through for the script.

Comments closed

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node:

RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon EMR cluster. However, in certain scenarios you might want to install it on a standalone Amazon EC2 instance and connect to a remote Amazon EMR cluster. Benefits of running RStudio on EC2 include the following:

  • Running RStudio Server on an EC2 instance, you can keep your scientific models and model artifacts on the instance. You might have to relaunch your EMR cluster to meet your application requirements. By running RStudio Server separately, you have more flexibility and don’t have to depend entirely on an Amazon EMR cluster.
  • Installing RStudio on the master node of Amazon EMR requires sharing of resources with the applications running on the same node. By running RStudio on a standalone Amazon EC2 instance, you can use resources as you need without having to share the resources with other applications.
  • You might have multiple Amazon EMR clusters in your environment. With RStudio on Edge node, you have the flexibility to connect to any EMR clusters in your environment.

There is one major difference between running RStudio Server on an Amazon EMR cluster vs. running it on a standalone Amazon EC2 instance. In the latter case, the instance needs to be configured as an Amazon EMR client (or edge node). By doing so, you can submit Apache Spark jobs and other Hadoop-based jobs from an instance other than EMR master node.

Click through for detailed, step-by-step instructions on how to do this.

Comments closed