Press "Enter" to skip to content

Day: December 6, 2016

Constrained Delegation

Regis Baccaro shows how to allow non-domain admins to configure Kerberos Constrained Delegation:

Now I need to add some special permissions to computer objects, so I click Add again. Once again, I’ll select the DBA group, then I need to switch to Descendant Computer objects. I click Write and then scroll down until I see Validated write to service principal name. I’ll click the box to enable it, and then OK, OK, and OK.

The end result looks like below :

2 permissions for DBA group,

  • All descendants objects : Write all properties

  • Descendant computer objects : Validate write to Service Principal Name

Regis has the whole process documented well, so check it out.

Comments closed

Avoiding Statistical Mistakes

Adrian Sampson explains some common mistakes in statistical analysis, particularly in computer science papers:

It’s tempting to think, when p \ge \alphapα, that you’ve found the opposite thing from the p < \alphap<αcase: that you get to conclude that there is no statistically significant difference between the two averages. Don’t do that!

Simple statistical tests like the tt-test only tell you when averages are different; they can’t tell you when they’re the same. When they fail to find a difference, there are two possible explanations: either there is no difference or you haven’t collected enough data yet. So when a test fails, it could be your fault: if you had run a slightly larger experiment with a slightly larger NN, the test might have successfully found the difference. It’s always wrong to conclude that the difference does not exist.

It’s an interesting read.  H/T Emmanuelle Rieuf.

Comments closed

Log Aggregation With Kafka And Redis

Asaf Yigal has a two-part series on comparing Apache Kafka and Redis for moving log events into Elasticsearch.  Part 1 explains the technologies:

Redis is a bit different from Kafka in terms of its storage and various functionalities. At its core, Redis is an in-memory data store that can be used as a high-performance database, a cache, and a message broker. It is perfect for real-time data processing.

The various data structures supported by Redis are strings, hashes, lists, sets, and sorted sets. Redis also has various clients written in several languages which can be used to write custom programs for the insertion and retrieval of data. This is an advantage over Kafka since Kafka only has a Java client. The main similarity between the two is that they both provide a messaging service. But for the purpose of log aggregation, we can use Redis’ various data structures to do it more efficiently.

Part 2 compares the two technologies and explains which works better when:

Kafka heavily relies on the machine memory (RAM). As we see in the previous graph, utilizing the memory and storage is an optimal way to maintain a steady throughput. Its performance depends on the data consumption rate. In the case that consumers don’t consume data fast enough, Kafka will have to read from a disk and not from memory which will slow down its performance.

As you might expect, the answer for which technology to use is “it depends.”

Comments closed

Apache Ranger On ElasticMapReduce

Varun Rao explains role-based access control using Apache Ranger on Amazon ElasticMapReduce:

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

If your job includes securing a Hadoop cluster, this is a nice read, even if you don’t use EMR.

Comments closed

Cleaning Up SSISDB

Peter Schott extends a script to clean up the SSIS catalog database:

I really appreciate what MS has done w/ the SSIS Catalog. We have built-in logging at a level that wasn’t possible in prior releases, but that comes at a cost. The default retention is pretty high (365 days) and MS chose to handle cleanup using cascading deletes. This sort of coding makes life easier for the developers, but almost always performs poorly. That’s especially the case when you have 1 parent row with 100’s of thousands or child rows related to it.  The problem is compounded because the first time people realize they need to lower the retention limit is about the same time that the database is starting to fill up or has filled up. At that point, it’s too late to lower the retention by a large number because the corresponding delete will still cause issues.

Click through for a script which helps extricate you from sticky situations.  The ideal scenario here would be to set your retention period correctly and not have to delete rows directly, but sometimes you’re stuck in a less-than-ideal situation.

Comments closed

Data Flow Sequence Containers

Todd McDermid is excited about data flow groups in Integration Services:

Data Flow Groups

Data Flow Groups is what they’re calling it, and it’s deceptively simple to use.  One of the reasons I’m sure I (and SSIS people I talk to who DID NOT LET ME KNOW IT WAS THERE) missed it is because I was expecting it to be a component in the toolbox.  Not so.
Code up your Data Flow as you normally would.  Then go and select the components that you want to group together – via clicking and dragging a selection window, or click-selecting components.  Any component combinations you want.  Then right-click and select Group.

I admit that I didn’t know it existed either.  This does seem rather useful.

Comments closed

Multivariate Analysis In R

Mala Mahadevan looks at using R to describe data sets with two explanatory variables:

From the plot we can see that type 3 trees have the smallest circumference while type 4 have the largest, with type 2 close to type 4. We can also see that type 1 trees have the thinnest dispersion of circumference while type 4 has the highest, closely followed by type 2.  We can also see that there are no significant outliers in this data.

Understanding whether variables are categorical or continuous is vital to understanding what you can and should do with them.

Comments closed

Custom Functions In Power BI Desktop

Reza Rad explains custom functions:

Benefits of Custom Function

  • Re-Use of Code
  • Increasing Consistency
  • Reducing Redundancy

With a Custom function you are able to re-use a query multiple times. If you want to change part of it, there is only one place to make that change, instead of multiple copies of that. You can call this function from everywhere in your code. and you are reducing redundant steps which normally causes extra maintenance of the code.

I like Reza’s example of reading from a holidays table, as it’s easy enough to follow without being so trivial that it leaves you to wonder what the real value is.

Comments closed