Press "Enter" to skip to content

Author: Kevin Feasel

HDF 3.2 Updates

Dinesh Chandrasekhar walks us through some of the updates to Hortonworks Data Flow version 3.2:

Kerberos keytab isolation
Kerberos keytabs can now be isolated at a per principal level. This allows for users in a multi-tenant environment to safely be able to reference specific keytabs and principals. This ensures that just because a user has access to a HDFS keytab they will not have access to all of the HDFS principals. This provides a more granular control so that users are limited to only the principals they require.

Kafka 1.1.1 Support
In HDF 3.2, Kafka has been upgraded from 1.0.0 to 1.1.1. Key features and improvements have been added with respect to security and governance. In addition to these bug fixes, an important new feature was added to capture producer and topic metrics at partition level without instrumenting or configuring interceptors on the clients. This provides a non-invasive approach to capture important metrics for producers without refactoring/modifying your existing Kafka clients

Hive 3 support
Apache NiFi now supports Hive 3 running on HDP 3.0. This support ensures better performance for Hive streaming to HDP, Hive streaming to S3, and the ability to write directly to ORC from NiFi without first converting your datasets to Avro. Writing directly to ORC for better Hive query performance is accomplished by using the NiFi PutORC processor. With HDF 3.2, a few other processors related to HBase and HDFS have also been updated and enhanced.

Looks like there are some good updates to this version.

Comments closed

Using The glue Package In R

Evgeni Chasnovski shows the glue package and also works around some trickiness with NULL:

Recently, fate lead me to try using {glue} in a package. I was very pleased to how it makes code more readable, which I believe is a very important during package development. However, I stumbled upon this pretty unexpected behavior:

y <- NULL
paste("I have", x, "apples and", y, "oranges.")
## [1] "I have 10 apples and oranges."
str(glue("I have {x} apples and {y} oranges."))
## Classes 'glue', 'character' chr(0)

If one of the expressions is evaluated into NULL then the output becomes empty string.

glue reminds me of string formatting in .NET languages.  On the whole, that’s a good thing.

Comments closed

Window Functions Have Defaults, Too

Steve Jones reminds us that when running a window function, there is a default window in place:

What I want to do is compare the passing yards each year with the most current value for that player, showing the plus or minus. This means that for Aaron Rodgers, who threw for 1675 yards in 2017, I’d want to show this for the first few years of his career:

This shows me an easy view of the years where he was better in his career than he is now. Last year was likely a down year because of injury, but we’ll see this year.

In any case, if I run this query using LAST_VALUE() for the final year of his career, I don’t get the right results.

It’s good to keep in mind the full syntax for a window function for just this reason.

Comments closed

Scheduling Jupyter Notebooks

Matthew Seal, et al, explain how they schedule runs of Jupyter notebooks:

On the surface, notebooks pose a lot of challenges: they’re frequently changed, their cell outputs need not match the code, they’re difficult to test, and there’s no easy way to dynamically configure their execution. Furthermore, you need a notebook server to run them, which creates architectural dependencies to facilitate execution. These issues caused some initial push-back internally at the idea. But that has changed as we’ve brought in new tools to our notebook ecosystem.

The biggest game-changer for us is Papermill. Papermill is an nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. What Papermill does is rather simple. It take a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.

Papermill does look quite interesting.

Comments closed

A Map Per Row In Power BI

Chris Webb shows an example of including a map per row in a Power BI table:

Since my post last week on using the Google Image Charts API to create sparklines and small multiples in Power BI has proved very popular, I thought I would do a follow-up showing how to use the Azure Maps API to create map small multiples. Here’s an example of what’s possible, a table from a sample report I built that displays crimes committed in London (sourced from here) in June 2018 with one row for each crime and a map column displaying the location of the crime:

Be sure to read Prateek Raina’s warning in the comments, though.

Comments closed

Finding Dependencies In SQL Server Objects

Lori Brown walks us through the sys.dm_sql_referencing_entities and referenced entities DMVs:

In both cases you can see that sys.sql_expression_dependencies provides the same basic info.  But when it comes to tables are referenced by a stored procedure, sys.sql_expression_dependencies gives you less detail than you can get when using sys.dm_sql_referenced_entities.  The biggest difference in both examples is that with sys.sql_expression_dependencies you will get info on cross-database and cross-server dependencies which can be super helpful.

However, once again I have to sound a note of caution because even sys.sql_expression_dependencies does not catch things referenced in a dynamic SQL string.  Sales.Orders is found in dynamic SQL in the, [Application].[Configuration_EnableInMemory] & [DataLoadSimulation].[Configuration_ApplyDataLoadSimulationProcedures] stored procedures but it does not catch this.  So far the best way to find objects in dynamic SQL strings that I know of is to check for the object in sys.sql_modules.

Read the whole thing.

Comments closed

Finding A Schema’s Owner

Jack Vamvas shows how to find out which user owns a particular schema in a database:

Question: How can I find the owner of a SQL Server schema ?   I want o find the owner through a t-sql solultion , rather than looking through the GUI.

Answer: To find a schema owner you can use either sys.schema view or the information_schema.schemata. Since SQL 2005,  information_schema.schemata has come into line with sys.schema.     information_schema.schemata returns schemas just from the current database.

Click through for simple examples of both methods.

Comments closed

Capturing UDF CPU Times

Jonathan Kehayias notes an improvement in recent versions of SQL Server:

Microsoft has been enhancing the contents of the ShowplanXML output for SQL Server over the last few releases and in SQL Server 2017 CU3, they introduced user-defined function (UDF) execution statistics into the QueryTimeStats node of the XML output. This was also back ported to SQL Server 2016 in Service Pack 2 for actual execution plans. This feature allows you to definitively know the impact of scalar UDF execution as part of the performance characteristics of a query. However, there is an interesting catch associated with using this feature; you have to collect the actual execution plan using an up to date version of SQL Server Management Studio or using SentryOne Plan Explorer, or the information will be removed from the execution plan.

This is a good improvement.  Historically, user-defined function costs were hidden in SSMS, as you’d see the cost of a single execution.  This made them look a lot more benign than they actually were.

Comments closed

The Risk Of Data Silos

Gaurav Dhillon argues that data silos are a major impediment to effective use of data:

The greatest stumbling block our respondents identified as hindering their attempts at better utilizing data is one that has existed for some time but seems to have worsened as data volumes have grown – data silos. Only 2 percent of our respondents considered their business to be completely effective at data sharing – for the rest, data silos are a real problem.

The causes for this are numerous, and span inconsistency of systems being used (42 percent), different data formats (38 percent), and a lack of coordinated data strategy (37 percent). On top of this, over a third highlight a lack of technology integration (36 percent) and/or legacy technology barriers (36 percent) as blocking attempts to effectively share data.

My first response is to say that this is in part due to the growth of microservices architecture, which seems to push data siloing.  But at the same time, this has been the case for a long time, so I don’t think it’s either a necessary or a sufficient explanation.

Comments closed

Solving Linear Optimization Problems In R

Mic walks us through a linear optimization problem and solves it with the lpSolve package:

I’m going to implement in R an example of linear optimization that I found in the book “Modeling and Solving Linear Programming with R” by Jose M. Sallan, Oriol Lordan and Vincenc Fernandez.  The example is named “Production of two models of chairs” and can be found at page 57, section 3.5. I’m going to solve only the first point.

The problem text is the following

A company produces two models of chairs: 4P and 3P. The model 4P needs 4 legs, 1 seat and 1 back. On the other hand, the model 3P needs 3 legs and 1 seat. The company has a initial stock of 200 legs, 500 seats and 100 backs. If the company needs more legs, seats and backs, it can buy standard wood blocks, whose cost is 80 euro per block. The company can produce 10 seats, 20 legs and 2 backs from a standard wood block. The cost of producing the model 4P is 30 euro/chair, meanwhile the cost of the model 3P is 40 euro/chair. Finally, the company informs that the minimum number of chairs to produce is 1000 units per month. Define a linear programming model, which minimizes the total cost (the production costs of the two chairs, plus the buying of new wood blocks).

I remember solving this exact problem (down to the four legs versus three legs bit) in grad school.  We used LINGO to do this, though I haven’t seen that language since.  H/T R-Bloggers

Comments closed