Press "Enter" to skip to content

Day: January 27, 2020

Deploying a Big Data Cluster to a Multi-Node kubeadm Cluster

Mohammad Darab shows how we can deploy a SQL Server Big Data Cluster on a multi-node kubeadm cluster:

There are a few assumptions before we get started:

1. You have at least 3 virtual machines running with the minimum hardware requirements
2. All your virtual machines are running Ubuntu Server 16.04 and have OpenSSH installed
3. All the virtual machines have static IPs and on the same subnet
4. All the virtual machines are updated and have been rebooted

Mohammad shows how to set up the cluster, configure Kubernetes, and then install Big Data Clusters. Definitely worth the read if you’re interested in building a Big Data Cluster on-premises.

Comments closed

Updates in Confluent Platform 5.4

Tim Berglund takes us through what has changed in Confluent Platform 5.4:

Role-Based Access Control (RBAC)

Back in July, we announced the preview for RBAC as part of the Confluent Platform 5.3 release. After gathering feedback and learning from everyone who tried it out, we are now pleased to announce the availability of RBAC in Confluent Platform 5.4. You can now make use of this feature in production environments with Confluent’s full support.

RBAC offers a centralized security implementation for enabling access to resources across the entire Confluent Platform with just the right level of granularity. You can control permissions you grant to users and groups to specific platform resources, starting at the cluster level and moving all the way down to individual topics, consumers groups, or even individual connectors. You do this by assigning users or groups to roles. This gets you out of the game of managing the individual permissions of a huge number of principals—a real problem for large enterprise deployments.

RBAC delivers comprehensive authorization enforced via all user interfaces (Confluent Control Center UI, CLI, and APIs), and across all Confluent Platform components (Control Center, Schema Registry, REST Proxy, MQTT Proxy, Kafka Connect, and KSQL). Given the distributed architecture not only of Apache Kafka but also of other platform components like Connect and KSQL, having a single framework to centrally manage and enforce security authorizations across all the components is, in a word, essential for managing security at scale.

Click through for several more features and where you can try it out, either on-premises or in a major cloud host.

Comments closed

Spark is Not ACID Compliant

Kundan Kumarr explains how it is that Apache Spark is not ACID compliant:

Atomicity states that it should either write full data or nothing to the data source when using spark data frame writer. Consistency, on the other hand, ensures that the data is always in a valid state.

As evident from the spark documentation below, it is clear that while saving data frame to a data source, existing data will be deleted before writing the new data. But in case of job failure, the original data will be lost or corrupted and no new data will be written.

Click through for an explanation of these two along with a demo, and then an explanation of how Spark Datasets don’t follow the Isolation or Durability properties either. I don’t think any of this is earth-shattering to people, but it is a good reminder that Spark doesn’t fit all use cases.

Comments closed

Explaining Black Box Models with LIME

Holger von Jouanne-Diedrich takes us through the intuition of LIME:

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!

Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.

In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.

One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

LIME is not trivial to use and it can be very slow, but it is a great way to visualize models.

Comments closed

Improving Join Performance on ADF Data Flows

Mark Kromer has a few tips on improving ADF data flow join performance:

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.

In other words, avoid this for your join condition:source1@movieId == '1'Instead, implement that with a dummy derived column. 

There are several good tips in this post.

Comments closed

Auditing Login Events Using Service Broker

Max Vernon takes us through using Service Broker to audit login events:

Logging to the SQL Server Error Log or the Windows Security Event Log means you’ll need some kind of tool to slice-and-dice the data, postmortem. It’s difficult to respond to events as they happen with this kind of auditing, and hard to create simple T-SQL queries to inspect the data. You could create a login trigger at the server level, but that will only allow you to capture successful logins. Coding the trigger incorrectly can result in everyone being locked out of the server. You’ll need to use the Dedicated Administrator Connection, otherwise known as the DAC, to login to the server and disable the errant trigger. Not fun.

Luckily, there is a much better option; using SQL Server’s built-in Event Notification service to receive login events through Service Broker. This event stream is asynchronous to the login process, meaning it won’t interrupt or slow down the login process, and it allows you to capture both successful and failed logins to a table, either locally or remotely. For larger SQL Server infrastructures, it’s not uncommon to setup a single SQL Server instance to gather this information for central analysis.

This blog post shows how to setup a database locally for auditing login events via SQL Server Event Notifications and Service Broker.

Click through for a script-heavy post which helps you all the way through the process.

Comments closed

Add-ClusterNode Error: Keyset Does Not Exist

Jonathan Kehayias troubleshoots a Windows Server clustering problem:

While working on a video recording for Paul this week I ran into an interesting problem with one of my Windows Server 2016 clusters. While attempting to add a new node to the cluster I ran into an exception calling Add-ClusterNode:

The server ‘’ could not be added to the cluster.
An error occurred while adding node ‘’ to cluster ‘SQL2K16-WSFC’.

Keyset does not exist

The windows account I was using was the domain administrator account and I had just recently made modifications that involved the certificate store on this specific VM, so I decided to take a backup of the VMDK and then revert to a snapshot to try again, and this time it worked.  So needless to say I was intrigued as to what I could have done that would be causing this error to happen.

Read on to see what the root cause was and how you can fix it.

Comments closed

Viewing Power BI Audit and Activity Logs

Jeff Pries gives us the rundown on auditing in Power BI:

When using the cloud-based Power BI Service,, every action that is taken while logged into the portal — whether it is viewing or publishing a report, creating a new workspace, or even signing up for a pro trial license, that activity is logged within the Microsoft servers as part of the Office 365 audit logs.

Accessing these logs can be accomplished via a couple of different methods (either through the Office 365 Audit Log functionality using the Office 365 Admin Center or PowerShell cmdlets; or through the new Power BI Activity Log (Power BI Get Activity Events) functionality accessible via a PowerShell cmdlet (Get-PowerBIActivityEvent) and an API). There are a few examples out there already on how to use these commands to access the data (and I have a post on accessing the data using the Power BI API and C# coming out in a few week), but there doesn’t seem to be a lot out there about the data itself, which is what I plan to focus on here.

Read on for more details as well as the structure around a forthcoming application to parse these logs and store them locally in SQL Server.

Comments closed