Protecting Hadoop Clusters From Malware

Michael Yoder and Suraj Acharya remind us that Hadoop clusters are made up of computers on a network, which means people will try to install malicious software:

Roughly two years ago there were a spate of attacks against the open source database solution MongoDB, as well as Hadoop. These attacks were ransomware: the attacker wiped or encrypted data and then demanded money to restore that data. Just like the recent attacks, the only Hadoop clusters affected were those that were directly connected to the internet and had no security features enabled. Cloudera published a blog post about this threat in January 2017. That blog post laid out how to ensure that your Hadoop cluster is not directly connected to the internet and encouraged the reader to enable  Cloudera’s security and governance features.

That blog post has renewed relevance today with the advent of XBash and DemonBot.

The origin story of XBash and DemonBot illustrates how security researchers view the Hadoop ecosystem and the lifecycle of a vulnerability. Back in 2016 at the conference in Luxembourg, two security researchers gave a talk entitled Hadoop Safari: Hunting for Vulnerabilities. They described Hadoop and its security model and then suggested some “attacks” against clusters that had no security features enabled. These attacks are akin to breaking in to a house while the front door is wide open.

Their advice is simple, but simple is good here:  it means you should be able to implement the advice without much trouble.

t-closeness And Data Anonymity

John Cook shares some thoughts about k-anonymity and t-closeness:

The idea of k-anonymity is that every database record appears at least k times. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

This is intended to be a “get you thinking” type of post, and John does have links to related posts which flesh things out a bit more.

Useful Powershell Aliases For Docker

Elton Stoneman shares a few useful aliases in Powershell for managing Docker containers:

Docker PowerShell Alias #2 – drmf

Removes all containers, whether they’re running or not. Useful when you want to reset your running containers and get back to zero:

function Remove-AllContainers { docker container rm -f $(docker container ls -aq)
Set-Alias drmf Remove-AllContainers 

Use with caution

Elton shares several more at the link and also includes a link to a Github gist with them all.

Azure ML Studio Supports R 3.4

David Smith notes that Azure ML Studio now supports R version 3.4:

With the Execute R Script module you can immediately use more than 650 R packages which come preinstalled in the Azure ML Studio environment. You can also use other R packages (including packages not on CRAN) and source in R scripts you develop elsewhere (as shown above), although this does require the time to install them in the Studio environment. You can even create custom ML Studio models encapsulating R code for others to use in the drag-and-drop environment.

If you’re new to Azure ML Studio, check out the Quickstart Tutorial for R to learn how use the Execute R Script module, and to check out what’s new in the latest update follow the link below.

Click through for more details.

Finding Who Changed Auto-Tuning Settings On Azure SQL DB

Arun Sirpal shows us the extended event to watch to learn who changed that auto-tuning setting:

It is said to be safe, reliable and proven using complex algorithms and built-in intelligence where it can do the following (see this link for more details:

  1. CREATE INDEX – identifies indexes that may improve performance of your workload, creates indexes, and automatically verifies that performance of queries has improved.
  2. DROP INDEX – identifies redundant and duplicate indexes daily, except for unique indexes, and indexes that were not used for a long time (>90 days). Please note that at this time the option is not compatible with applications using partition switching and index hints.
  3. FORCE LAST GOOD PLAN – identifies SQL queries using execution plan that is slower than the previous good plan, and queries using the last known good plan instead of the regressed plan.

Personally I don’t  enable the option where it is allowed a “free-for-all” when creating/dropping indexes and forcing certain query plans. I like controlling the change, especially for production databases. To force this concept I wanted to use Extended Events to know when / if someone changed my settings for automatic tuning against my database.

Click through for the script.

Using Microsoft Flow To Find Power BI Data Sources In Use

Chris Webb continues his series on using Microsoft Flow to extend Power BI:

The problem with self-service BI is that you never quite know what your users are up to. For example, what data sources are they using? Are there hundreds of Excel files being used as data sources for reports that you don’t know about? If so, where are they? Could they and should they be replaced by a database or something else more robust? In this post I’ll show you how you can use Microsoft Flow and the Power BI REST API (see part 1 to find out how to create a Flow custom connector to call the Power BI API) to get the details of all the data sources used in all of the workspaces of your Power BI tenant.

I’ll admit that doing this turned out to be a bit trickier than I had expected. My plan was to use the GetDatasetsAsAdmin endpoint to get a list of all datasets, loop over each one and then call the (undocumented, but in the REST API’s Swagger file and therefore in my custom connector) GetDatsourcesAsAdmin endpoint to get the datasources in each dataset. Both these endpoints require administrative permissions to call, so I made sure my custom connector had the correct permissions (at least Tenant.Read.All – you can check this in the Azure Portal on the app you registered in Azure Active Directory) and I ran the Flow as a user with Power BI Admin permissions. But I kept getting 404 errors when requesting the data sources for certain datasets .

Chris explains why those 404s appear and what you can do about them.


November 2018
« Oct