HDF 3.2 Updates

Dinesh Chandrasekhar walks us through some of the updates to Hortonworks Data Flow version 3.2:

Kerberos keytab isolation
Kerberos keytabs can now be isolated at a per principal level. This allows for users in a multi-tenant environment to safely be able to reference specific keytabs and principals. This ensures that just because a user has access to a HDFS keytab they will not have access to all of the HDFS principals. This provides a more granular control so that users are limited to only the principals they require.

Kafka 1.1.1 Support
In HDF 3.2, Kafka has been upgraded from 1.0.0 to 1.1.1. Key features and improvements have been added with respect to security and governance. In addition to these bug fixes, an important new feature was added to capture producer and topic metrics at partition level without instrumenting or configuring interceptors on the clients. This provides a non-invasive approach to capture important metrics for producers without refactoring/modifying your existing Kafka clients

Hive 3 support
Apache NiFi now supports Hive 3 running on HDP 3.0. This support ensures better performance for Hive streaming to HDP, Hive streaming to S3, and the ability to write directly to ORC from NiFi without first converting your datasets to Avro. Writing directly to ORC for better Hive query performance is accomplished by using the NiFi PutORC processor. With HDF 3.2, a few other processors related to HBase and HDFS have also been updated and enhanced.

Looks like there are some good updates to this version.

Using The glue Package In R

Kevin Feasel

2018-08-24

R

Evgeni Chasnovski shows the glue package and also works around some trickiness with NULL:

Recently, fate lead me to try using {glue} in a package. I was very pleased to how it makes code more readable, which I believe is a very important during package development. However, I stumbled upon this pretty unexpected behavior:

y <- NULL
paste("I have", x, "apples and", y, "oranges.")
## [1] "I have 10 apples and oranges."
str(glue("I have {x} apples and {y} oranges."))
## Classes 'glue', 'character' chr(0)

If one of the expressions is evaluated into NULL then the output becomes empty string.

glue reminds me of string formatting in .NET languages.  On the whole, that’s a good thing.

Window Functions Have Defaults, Too

Steve Jones reminds us that when running a window function, there is a default window in place:

What I want to do is compare the passing yards each year with the most current value for that player, showing the plus or minus. This means that for Aaron Rodgers, who threw for 1675 yards in 2017, I’d want to show this for the first few years of his career:

This shows me an easy view of the years where he was better in his career than he is now. Last year was likely a down year because of injury, but we’ll see this year.

In any case, if I run this query using LAST_VALUE() for the final year of his career, I don’t get the right results.

It’s good to keep in mind the full syntax for a window function for just this reason.

Scheduling Jupyter Notebooks

Matthew Seal, et al, explain how they schedule runs of Jupyter notebooks:

On the surface, notebooks pose a lot of challenges: they’re frequently changed, their cell outputs need not match the code, they’re difficult to test, and there’s no easy way to dynamically configure their execution. Furthermore, you need a notebook server to run them, which creates architectural dependencies to facilitate execution. These issues caused some initial push-back internally at the idea. But that has changed as we’ve brought in new tools to our notebook ecosystem.

The biggest game-changer for us is Papermill. Papermill is an nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. What Papermill does is rather simple. It take a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.

Papermill does look quite interesting.

A Map Per Row In Power BI

Chris Webb shows an example of including a map per row in a Power BI table:

Since my post last week on using the Google Image Charts API to create sparklines and small multiples in Power BI has proved very popular, I thought I would do a follow-up showing how to use the Azure Maps API to create map small multiples. Here’s an example of what’s possible, a table from a sample report I built that displays crimes committed in London (sourced from here) in June 2018 with one row for each crime and a map column displaying the location of the crime:

Be sure to read Prateek Raina’s warning in the comments, though.

Finding Dependencies In SQL Server Objects

Lori Brown walks us through the sys.dm_sql_referencing_entities and referenced entities DMVs:

In both cases you can see that sys.sql_expression_dependencies provides the same basic info.  But when it comes to tables are referenced by a stored procedure, sys.sql_expression_dependencies gives you less detail than you can get when using sys.dm_sql_referenced_entities.  The biggest difference in both examples is that with sys.sql_expression_dependencies you will get info on cross-database and cross-server dependencies which can be super helpful.

However, once again I have to sound a note of caution because even sys.sql_expression_dependencies does not catch things referenced in a dynamic SQL string.  Sales.Orders is found in dynamic SQL in the, [Application].[Configuration_EnableInMemory] & [DataLoadSimulation].[Configuration_ApplyDataLoadSimulationProcedures] stored procedures but it does not catch this.  So far the best way to find objects in dynamic SQL strings that I know of is to check for the object in sys.sql_modules.

Read the whole thing.

Finding A Schema’s Owner

Jack Vamvas shows how to find out which user owns a particular schema in a database:

Question: How can I find the owner of a SQL Server schema ?   I want o find the owner through a t-sql solultion , rather than looking through the GUI.

Answer: To find a schema owner you can use either sys.schema view or the information_schema.schemata. Since SQL 2005,  information_schema.schemata has come into line with sys.schema.     information_schema.schemata returns schemas just from the current database.

Click through for simple examples of both methods.

Capturing UDF CPU Times

Jonathan Kehayias notes an improvement in recent versions of SQL Server:

Microsoft has been enhancing the contents of the ShowplanXML output for SQL Server over the last few releases and in SQL Server 2017 CU3, they introduced user-defined function (UDF) execution statistics into the QueryTimeStats node of the XML output. This was also back ported to SQL Server 2016 in Service Pack 2 for actual execution plans. This feature allows you to definitively know the impact of scalar UDF execution as part of the performance characteristics of a query. However, there is an interesting catch associated with using this feature; you have to collect the actual execution plan using an up to date version of SQL Server Management Studio or using SentryOne Plan Explorer, or the information will be removed from the execution plan.

This is a good improvement.  Historically, user-defined function costs were hidden in SSMS, as you’d see the cost of a single execution.  This made them look a lot more benign than they actually were.

Categories

August 2018
MTWTFSS
« Jul Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031