This post explains kerberizing an existing Hadoop cluster using Ambari. Kerberos helps with the
Authenticationpart of enterprise security (while
data protectionbeing the remaining parts).
HDP uses Kerberos, which is an industry standard for authenticate users and resources and providing strong identity for users. Apache Ambari can kerberize an existing cluster by using an existing MIT key distribution center (KDC) or Microsoft’s Active Directory.
This was a lot easier than I expected.
Within Machine Learning many tasks are – or can be reformulated as – classification tasks.
In classification tasks we are trying to produce a model which can give the correlation between the input data and the class each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.
Ahmet has his entire post saved as a Jupyter notebook.
It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.
That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.
Athena is Amazon’s response to Azure Data Lake Analytics. Check out Mark’s blog post for a good way of getting started with Athena.
There are 2 types of lock resources in Azure.
- Delete – Obviously you can’t delete but you can read / modify a resource, this applies to authorised users.
- ReadOnly – Authorised users can read a resource but they cannot edit or delete it.
For this blog post I create a delete lock on one of my SQL Databases.
My overly simplistic advice: lock any production resource which you wouldn’t want accidentally deleted. It won’t prevent a malicious user from doing something catastrophic, but it can prevent the “Oops, I meant to click the thing above this” class of mistake.
Comprehensive Resource Archive Network [CRAN] is where one can download Open Source R, packages and contains lots of information about R.
Microsoft R Open which is a fully CRAN compatible version created using the Intel MKL for improved performance can be downloaded here.
One thing I would push a little bit on that list is R Tools for Visual Studio. My default R IDE is still R Studio, but RTVS has made some nice improvements, and it’s worth checking out.
HTAP is used to describe the capability of a single database that can perform both online transaction processing (OLTP) and online analytical processing (OLAP) for the purpose of real-time operational intelligence processing. The term was created by Gartner in 2014.
In the SQL Server world you can think of it as: In-memory analytics (columnstore) + in-memory OLTP = real-time operational analytics. Microsoft supports this in SQL Server 2016 (see SQL Server 2016 real-time operational analytics).
I’m not completely sold on HTAP yet, particularly once you get to high-scale OLTP systems doing hundreds of thousands of transactions per second. That said, there’s always more and more pressure to get data available for analytics faster and faster.
I have not met a setup where applying compression was not an option, yet. Obviously this has a penalty cost on CPU while executing the backup, and will affect the rest of the tasks running on the server (even if you have your data and backup dir on different drives). But in my experience, the impact is negligible.
This may not be the case with the encryption option, as this has a much larger foot print on the server. You should be using this with some caution in production. Test on smaller subsets of the data if in doubt.
Another thing to keep in mind, as always when dealing with encryption, do remember the password. There is no way of retrieving the data other than with the proper password.
My goal is to be able to rebuild any cube from the relational database, but even with that goal in mind, it is smart to have backups.
SQL Server doesn’t really track index create or modification date by default
I say “really”, because SQL Server’s default trace captures things like index create and alter commands. However, the default trace rolls over pretty quickly on most active servers, and it’s rare that you’re looking up the creation date for an index you created five minutes ago.
I think it’s fine that SQL Server doesn’t permanently store the creation date and modification date for most indexes, because not everyone wants this information — so why not make the default as lightweight as possible?
That said, Kendra has several methods for answering the question of when a particular index was created.