Press "Enter" to skip to content

Author: Kevin Feasel

Choosing Azure Data Lake Analytics Versus Azure Databricks

Ginger Grant helps us make the decision between using Azure Data Lake Analytics and Azure Databricks:

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics.  With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Click through for the comparison.

Comments closed

Miscellany On Java In SQL Server

Niels Berglund continues a series on SQL Server 2019 and Java support:

This post is the fourth post where I look at the Java extension in SQL Server, i.e. the ability to execute Java code from inside SQL Server. The previous three posts are:
SQL Server 2019 Extensibility Framework & Java – Hello World: We looked at installing and enabling the Java extension, as well as some very basic Java code.
SQL Server 2019 Extensibility Framework & Java – Passing Data: In this post, we discussed what is required to pass data back and forth between SQL Server and Java.
SQL Server 2019 Extensibility Framework & Java – Null Values: This, the Null Values, post is a follow up to the Passing Data post, and we look at how to handle null values in data passed to Java.
This fourth post acts as a “roundup” of miscellaneous “stuff” I did not cover in the three previous posts

If you haven’t seen the first three, check them out too.

Comments closed

The State Of Database Scoped Configurations

Niko Neugebauer takes us through the current state of Database Scoped Configurations in SQL Server:

I have already blogged about the first version of the Database Scoped Configurations for SQL Server 2016, with 4 visible optionsplus the procedure cache cleaning option, but we have followed in SQL Server 2017 with 5 (listed) & 9 (in practice – DISABLE_INTERLEAVED_EXECUTION_TVF, DISABLE_BATCH_MODE_ADAPTIVE_JOINS, BATCH_MODE_MEMORY_GRANT_FEEDBACK, BATCH_MODE_ADAPTIVE_JOINS are visible and functioning), and in just another year we have received a huge upgrade to the currently available 21 for SQL Server 2019.

It seems like this is a common route the SQL Server teams are going down, and it makes sense: your settings for Mega-DB probably shouldn’t be the same as for the tiny database in the corner. Oh, and that whole Azure SQL Database thing.

Comments closed

Building A Kubernetes Cluster With Kubespray

Chris Adkin continues a series on Kubernetes clusters:

In essence Kubespray is a bunch of Ansible playbooks; yaml file that specify what actions should take place against one or more machines specified in a hosts.ini file, this resides in what is known as an inventory. Of all the infrastructure as code tools available at the time of writing, Ansible is the most popular and has the greatest traction. Examples of playbooks produced by Microsoft can be found on GitHub for automating tasks in Azure and deploying SQL Server availability groups on Linux. The good news for anyone into PowerShell is that PowerShell modules can be installed via Ansible and PowerShell commands can be executed via Ansible. Also, there are people already using PowerShell desired state configuration with Ansible. Ansible’s popularity is down to the facts it is easy to pick up and agent-less because it relies on ssh, hence why one of the steps in this post includes the creation of keys for ssh. This free tutorial is highly recommended for anyone wishing to pick up Ansible.

Click through for a step-by-step tutorial.

Comments closed

Creating An Azure Storage Account

John Morehouse walks us through setting up an Azure Storage Account through the Azure Portal:

Azure offers a lot of features that enable IT professionals to really enhance their environment.  One feature that I really like about Azure is storage accounts.  Since disk is relatively cheap, this continues to hold true in the cloud.  For less than $100 per month, you could get up to 5TB of storage including redundancy to another Azure region.

Read on to learn how to set up one of these.

Comments closed

Diving Into OPTION(RECOMPILE)

Arthur Daniels explains some of the nuance behind OPTION(RECOMPILE) on T-SQL statements:

SQL Server will compile an execution plan specifically for the statement that the query hint is on. There’s some benefits, like something called “constant folding.” To us, that just means that the execution plan might be better than a normal execution plan compiled for the current statement.
It also means that the statement itself won’t be vulnerable to parameter sniffing from other queries in cache. In fact, the statement with option recompile won’t be stored in cache.

Click through for a couple of demos as well as a discussion of positives and negatives regarding its use.

Comments closed

MLflow 0.8.1 Released

Aaron Davidson, et al, announce a new version of Databricks MLflow:

When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of pyspark.sql.types.DoubleType filters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of pyspark.sql.types.ArrayType(DoubleType) returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the result_type parameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.

Read on for a pretty long list of updates.

Comments closed

File Formats Supported In HDFS

Manoj Pandey covers a few of the file types supported by the Hadoop Distributed File System:

HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.

There are a few other formats not included in this list, including RCFile (which has been superseded by both ORC and Parquet), but this hits the highlights.

Comments closed