Press "Enter" to skip to content

Day: September 17, 2018

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark:

Time series analysis has two components: time series manipulation and time series modeling.

Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data cleaning and feature engineering. Typical functions in time series manipulation include:

  • Joining: joining two time-series datasets, usually by the time
  • Windowing: feature transformation based on a time window
  • Resampling: changing the frequency of the data
  • Filling in missing values or removing NA rows.

Time series modeling is the process of identifying patterns in time-series data and training models for prediction. It is a complex topic; it includes specific techniques such as ARIMA and autocorrelation, as well as all manner of general machine learning techniques (e.g., linear regression) applied to time series data.

Flint focuses on time series manipulation. In this blog post, we demonstrate Flint functionalities in time series manipulation and how it works with other libraries, e.g., Spark ML, for a simple time series modeling task.

Basho went all-in on a time-series product for Riak and it did not work out well for them.  I’ll be curious to see if Flint has more staying power.

Comments closed

Against Multi-Cloud Models

Tyler Treat argues against companies looking at multi-cloud models:

A multi-cloud strategy looks great on paper, but it creates unneeded constraints and results in a wild-goose chase. For most, it ends up being a distraction, creating more problems than it solves and costing more money than it’s worth. I’m going to caveat that claim in just a bit because it’s a bold blanket statement, but bear with me. For now, just know that when I say “multi-cloud,” I’m referring to the idea of running the same services across vendors or designing applications in a way that allows them to move between providers effortlessly. I’m not speaking to the notion of leveraging the best parts of each cloud provider or using higher-level, value-added services across vendors.

Multi-cloud rears its head for a number of reasons, but they can largely be grouped into the following points: disaster recovery (DR), vendor lock-in, and pricing. I’m going to speak to each of these and then discuss where multi-cloud actually does come into play.

It’s an interesting article.  I think that Tyler is right, but that companies should be capable of switching between cloud providers or even creating hybrid approaches should the need arise.

Comments closed

An Update To ssisUnit

Bartosz Ratajczyk has added some functionality to ssisUnit:

Second – you can get and set the properties of the project and its elements. Like – overwriting project connection managers (I designed it with this particular need on my mind). You can now set the connection string the different server (or database) – in the PropertyPath of the PropertyCommand use \Project\ConnectionManagers, write the name of the connection manager with the extension, and use one of the Properties. You can do it during the Test setup (or all tests setup), but not during the test suite setup, as ssisUnit is not aware of the project until it loads it into the memory.

Good on Bartosz for resurrecting a stable but moribund project and adding some enhancements.

Comments closed

Ad Hoc Functions In T-SQL

Riley Major shows a couple techniques for including ad hoc functions in T-SQL, namely Common Table Expressions and the APPLY operator:

It’s helpful to think of each APPLY as a pipe operation, taking the values from the previous derived table and passing them into the next to be manipulated. Programming T-SQL in this manner (loosely) approximates modern functional programming techniques.

It keeps each step of the logic smaller, so that it’s easier to understand. And you can expose the intermediary columns to help with debugging.

This is one of my favorite uses of the APPLY operator, as it lets you think through a problem step-by-step while still allowing the optimizer to create a set-based solution for you.

Comments closed

Limiting Azure Administrator Data Access

Melissa Coates gives us a look at one aspect of Azure security:

Recently a customer expressed concern that an owner of an Azure resource group automatically gains access to the data within the services contained in the resource group. In this case, the customer was specifically referring to data in Azure Data Lake Storage Gen 1 but this concept applies to Azure Storage and other data-oriented services in Azure as well. The customer’s comment prompted me to look into available alternatives. This is by no means a detailed security post…rather, I’m trying to share a few nuggets of what I learned.

Worth the read.  Much of the latest round of regulatory push seems to be in the realm of limiting high-access insiders (like DBAs) from accessing sensitive information, and this post aligns with that.

Comments closed

Forcing MAXDOP In Azure SQL DB

Arun Sirpal shows us that you can change MAXDOP in Azure SQL Database:

In this quick post I will show you my parallel plan and how I use MAXDOP = 1 to suppress parallel plan generation so the operation will be executed serially. (Disclaimer – I am not saying this is the right thing to do, merely using it as an example of tweaking this setting, to be honest in 10 years I have changed MAXDOP = 1 twice). I executed a query in Azure. You can see the classic operators such as gather streams and repartition streams.

This change will affect all queries hitting that database, so it’s a coarser tool than changing cost threshold for parallelism (not allowed) or setting MAXDOP per-query (allowed).

Comments closed

Using Table-Valued Parameters With sp_executesql

Kenneth Fisher shows how to include table-valued parameters in a dynamic SQL query:

Recently I did a presentation on dynamic SQL. In the presentation I pointed out the similarity of using sp_executesql to creating a stored procedure to do the same task. After the session I was asked: If that’s the case, can I pass a TVP (table valued parameter) into sp_executesql?

Awesome question! Let’s give it a shot.

Read on to see how to do this.

Comments closed

Comparing Instance Configurations With sys.configurations

Jana Sattainathan has a script to compare two SQL Server instances’ configuration settings:

Steps:

1. Create a Linked Server to 2nd instance on 1st Instance
2. Create an empty table to hold the comparison data
3. Insert the 1st instance sys.configurations data into the table
4. Insert the 2nd instance sys.configurations data into the table using Linked Server
5. Compare to get the differences
6. Cleanup – Drop Linked Server and the temporary comparison table

Click through for the script.

Comments closed