Press "Enter" to skip to content

Month: August 2018

Scheduling Jupyter Notebooks

Matthew Seal, et al, explain how they schedule runs of Jupyter notebooks:

On the surface, notebooks pose a lot of challenges: they’re frequently changed, their cell outputs need not match the code, they’re difficult to test, and there’s no easy way to dynamically configure their execution. Furthermore, you need a notebook server to run them, which creates architectural dependencies to facilitate execution. These issues caused some initial push-back internally at the idea. But that has changed as we’ve brought in new tools to our notebook ecosystem.

The biggest game-changer for us is Papermill. Papermill is an nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. What Papermill does is rather simple. It take a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.

Papermill does look quite interesting.

Comments closed

A Map Per Row In Power BI

Chris Webb shows an example of including a map per row in a Power BI table:

Since my post last week on using the Google Image Charts API to create sparklines and small multiples in Power BI has proved very popular, I thought I would do a follow-up showing how to use the Azure Maps API to create map small multiples. Here’s an example of what’s possible, a table from a sample report I built that displays crimes committed in London (sourced from here) in June 2018 with one row for each crime and a map column displaying the location of the crime:

Be sure to read Prateek Raina’s warning in the comments, though.

Comments closed

Finding Dependencies In SQL Server Objects

Lori Brown walks us through the sys.dm_sql_referencing_entities and referenced entities DMVs:

In both cases you can see that sys.sql_expression_dependencies provides the same basic info.  But when it comes to tables are referenced by a stored procedure, sys.sql_expression_dependencies gives you less detail than you can get when using sys.dm_sql_referenced_entities.  The biggest difference in both examples is that with sys.sql_expression_dependencies you will get info on cross-database and cross-server dependencies which can be super helpful.

However, once again I have to sound a note of caution because even sys.sql_expression_dependencies does not catch things referenced in a dynamic SQL string.  Sales.Orders is found in dynamic SQL in the, [Application].[Configuration_EnableInMemory] & [DataLoadSimulation].[Configuration_ApplyDataLoadSimulationProcedures] stored procedures but it does not catch this.  So far the best way to find objects in dynamic SQL strings that I know of is to check for the object in sys.sql_modules.

Read the whole thing.

Comments closed

Finding A Schema’s Owner

Jack Vamvas shows how to find out which user owns a particular schema in a database:

Question: How can I find the owner of a SQL Server schema ?   I want o find the owner through a t-sql solultion , rather than looking through the GUI.

Answer: To find a schema owner you can use either sys.schema view or the information_schema.schemata. Since SQL 2005,  information_schema.schemata has come into line with sys.schema.     information_schema.schemata returns schemas just from the current database.

Click through for simple examples of both methods.

Comments closed

Capturing UDF CPU Times

Jonathan Kehayias notes an improvement in recent versions of SQL Server:

Microsoft has been enhancing the contents of the ShowplanXML output for SQL Server over the last few releases and in SQL Server 2017 CU3, they introduced user-defined function (UDF) execution statistics into the QueryTimeStats node of the XML output. This was also back ported to SQL Server 2016 in Service Pack 2 for actual execution plans. This feature allows you to definitively know the impact of scalar UDF execution as part of the performance characteristics of a query. However, there is an interesting catch associated with using this feature; you have to collect the actual execution plan using an up to date version of SQL Server Management Studio or using SentryOne Plan Explorer, or the information will be removed from the execution plan.

This is a good improvement.  Historically, user-defined function costs were hidden in SSMS, as you’d see the cost of a single execution.  This made them look a lot more benign than they actually were.

Comments closed

The Risk Of Data Silos

Gaurav Dhillon argues that data silos are a major impediment to effective use of data:

The greatest stumbling block our respondents identified as hindering their attempts at better utilizing data is one that has existed for some time but seems to have worsened as data volumes have grown – data silos. Only 2 percent of our respondents considered their business to be completely effective at data sharing – for the rest, data silos are a real problem.

The causes for this are numerous, and span inconsistency of systems being used (42 percent), different data formats (38 percent), and a lack of coordinated data strategy (37 percent). On top of this, over a third highlight a lack of technology integration (36 percent) and/or legacy technology barriers (36 percent) as blocking attempts to effectively share data.

My first response is to say that this is in part due to the growth of microservices architecture, which seems to push data siloing.  But at the same time, this has been the case for a long time, so I don’t think it’s either a necessary or a sufficient explanation.

Comments closed

Solving Linear Optimization Problems In R

Mic walks us through a linear optimization problem and solves it with the lpSolve package:

I’m going to implement in R an example of linear optimization that I found in the book “Modeling and Solving Linear Programming with R” by Jose M. Sallan, Oriol Lordan and Vincenc Fernandez.  The example is named “Production of two models of chairs” and can be found at page 57, section 3.5. I’m going to solve only the first point.

The problem text is the following

A company produces two models of chairs: 4P and 3P. The model 4P needs 4 legs, 1 seat and 1 back. On the other hand, the model 3P needs 3 legs and 1 seat. The company has a initial stock of 200 legs, 500 seats and 100 backs. If the company needs more legs, seats and backs, it can buy standard wood blocks, whose cost is 80 euro per block. The company can produce 10 seats, 20 legs and 2 backs from a standard wood block. The cost of producing the model 4P is 30 euro/chair, meanwhile the cost of the model 3P is 40 euro/chair. Finally, the company informs that the minimum number of chairs to produce is 1000 units per month. Define a linear programming model, which minimizes the total cost (the production costs of the two chairs, plus the buying of new wood blocks).

I remember solving this exact problem (down to the four legs versus three legs bit) in grad school.  We used LINGO to do this, though I haven’t seen that language since.  H/T R-Bloggers

Comments closed

The Luminance Illusion With gganimate

David Smith highlights an example of the luminance illusion:

Colin created this animation in R using the gganimate package (available on GitHub from author Thomas Lin Pederson), and the process is delightfully simple. It begins with a chart of 10 “points”, each being the same grey square equally spaced across the shaded background. Then, a simple command animates the transitions from one point to the next, and interpolates between them smoothly:

library(gganimate)
gg_animated <- gg + 
  transition_time(t) + 
  ease_aes('linear')

Check it out, both as a parlor trick and a way of getting a grip on the gganimate package.

Comments closed

Styling In ggplot2

The folks at Jumping Rivers show an example of creating a nice-looking plot with ggplot2:

The changes we’ve made so far would impossible for any package to do for us – how would the package know the plot title? We can now improve the look and feel of the plot. There are two ways of complementary ways of doing this: scales and themes. The ggplot scales control things like colours and point size. In the latest version of ggplot2, version 3.0.0, the Viridis colour palette was introduced. This palette is particularly useful for creating colour-blind friendly palettes

g + scale_colour_viridis_d() # d for discrete

With a few lines of code, those default graphs can look a lot nicer.

Comments closed

Formatting Queries As JSON With FOR JSON

Eduardo Pivaral shows off the FOR JSON functionality in SQL Server 2016 and later:

For most of real-world applications, the JSON AUTO will not give you the control you could need over your file format, for having more control over it, you must use the JSON PATH option, along with the ROOT option as follows:

SELECT TOP 10
 id,
 dataVarchar,
 dataNumeric,
 dataInt,
 dataDate
FROM [dbo].[MyTestTable]
FOR JSON PATH, ROOT('TestTable')

Eduardo has several examples along these lines.

Comments closed