Press "Enter" to skip to content

Author: Kevin Feasel

Lambda Architecture In Azure

Jared Zagelbaum describes the Lambda architecture pattern and explains how you can use tooling in Azure to implement it:

Lambda is an organic result of the limitations of existing tools. Distributed systems architects and developers commonly criticize its complexity – and rightly so. Those of us that have worked extensively in Extract-Transform-Load and symmetric multiprocessing systems see red flags when code is replicated in multiple services. Ensuring data quality and code conformity across multiple systems, whether massively parallel processing (MPP) or symmetrically parallel system (SMP), has the same best practice: the least amount of times you reproduce code is always the correct number of times.

We reproduce code in lambda because different services in MPP systems are better at different tasks. The maturity of tools historically hasn’t allowed us to process streams and batch in a single tool. This is starting to change, with Apache Spark emerging as a single preferred compute service for stream and batch querying, hence the timing of Azure Databricks. However, on the storage side, what was meant to be an immutable store that is the data lake in practice, can become the dreaded swamp when governance or testing fails; which is not uncommon. A fundamentally different assumption to how we process data is required to combat this degradation. Enter: the kappa architecture, which we’ll examine in the next post of this series.

Interesting reading.

Comments closed

Cloning A SQL Server Installation

Jana Sattainathan shows how you can find the configuration options used when installing SQL Server:

Let us say that you want to install SQL Server on another host say NewHost but you want it to have the same settings as another host/instance on say ModelHost. In fact, let us say that these are almost identical hosts with similar drive locations and such and you are positive that you want the exact setup on both.

The hard way is to look at what features ModelHost has and try to click through the installation wizard with the correct options selected/input.

Fortunately, there is an easier way to clone an installation. Even if you installed manually using the installation wizard, SQL Server still generates a configuration file with all the settings used (except passwords) for the installation. We can simply use that file from the ModelHost to drive the installation on NewHost

Click through to see where that configuration file is and how you can use it.

Comments closed

Don’t Run Services As Root On Linux

Kellyn Pot’vin-Gorman explains why running SQL Server as root is a bad idea:

Although enhancements have changed Windows installations for applications to run with a unique user, I created a mssql OS user even back on SQL Server 2000 on Windows as I had a tendency to use similar security practices for all database platforms as a multi-platform DBA.  With that being said-  yes, it introduced complexity, but it was for a reason: users should be restricted to the least amount of privileges required.  To grant any application or database “God” powers on a host is akin to granting DBA to every user in the database, but at the host level.  As important as security is to DBAs INSIDE the database, it should be just as important to us OUTSIDE of it on the host it resides on.

Security is important and has become more complex with the increase of security breaches and introduction of the cloud.  One of the most simple ways to do this is to ensure that all application owners on a host are granted only the privileges they require.  The application user should only utilize SUDO, stick bit, iptables, SUID, SGID and proper group creation/allocation if and when required.

It’s the same reason we don’t recommend giving everyone sa rights to databases.  Read on for more.

Comments closed

Automatic Tuning In SQL Server 2017

Arun Sirpal shows off one of the more interesting features in SQL Server 2017:

Before we begin any further let’s do a little recap. Automatic tuning in SQL Server 2017 notifies you whenever a potential performance issue is detected, and lets you apply corrective actions, or lets the Database Engine automatically fix performance problems, this is also available in Azure SQL Database.

There are 2 parts to it, automatic plan correction and automatic index management, for SQL Server 2017, automatic index management it IS NOT part of the product.

To switch automatic plan correction on you will need to run the following code against your database.

I’m looking forward to seeing this expand much further.

Comments closed

Beware Multi-Assignment dplyr::mutate() Statements

John Mount hits on an issue when using dplyr backed by a database in R:

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

Read on for a couple of suggested solutions.

Comments closed

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Comments closed

Stream Reactor Update

Andrew Stevenson announces Stream Reactor 1.0.0 for Kafka Connect 1.0:

Stream Reactor is an Apache License, Version 2.0 open source collection of components built on top of Kafka and provides Kafka Connect compatible connectors to move data between Kafka and popular data stores. Stream Reactor provides source connectors to publish data into Kafka and sink connectorsto bring data from Kafka into other systems. The connectors support KCQL (Kafka Connect Query Language), an open source component of Lenses SQL Enginethat provides an elegant and simple SQL like syntax for selecting fields and routing from sources or topics to Kafka or the target system (topic to target entity mapping, field selection, auto creation, auto evolution, error policies).

We hope you find Stream Reactor useful, and want to give it a try! Stream Reactor has over 25 connectors available, tested and documented, supporting both Kafka 0.11 and Kafka 1.0 and you can give it a go by downloading Lenses Development Environment or find the jars on GitHub, or even build the code locally and help us improve and add even more connectors.

Read on for more details, as well as a link to the GitHub repo.

Comments closed

Filtering Class Type In Server Audits

Solomon Rutzky shows how to filter the class_type field in a SQL Server audit to filter out scalar valued functions:

According to the documentation for CREATE SERVER AUDIT, I should be able to add a WHERE clause (starting in SQL Server 2012) to do simple filtering. The documentation states that the list of fields that can be filtered on is found in the documentation for the [sys.fn_get_audit_file] system function. Selecting from that function showed that the class_type field contains the “object type” (“FN” = Function, “P” = Stored Procedure, etc.) value. That’s just what I needed. So I’m just about done, right? Not so fast!

Nothing’s ever that easy, it seems.  Read on for the full story.

Comments closed

SQL Server And Polygons

Bert Wagner shows that SQL Server interprets polygon definitions a bit differently from GeoJSON:

Discerning eyes might notice that SQL Server didn’t shade in the area inside of the polygon — it instead shaded in everything in the world EXCEPT for the interior of our polygon.

If this is the first time you’ve encountered this behavior then you’re probably confused by this behavior — I know I was.

Read on to learn more about left-hand and right-hand polygon specifications and how to translate from one to the other.

Comments closed

Defining A Data Lake

Derik Hammer gives us a definition of the data lake:

Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.

The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.

I would like to take this opportunity to remind readers that the Aristotelian opposite of the Data Lake is the Data Swamp.  Derik uses this term as well and it makes me feel warm and fuzzy inside to see broad adoption of this term.

Comments closed