Microsoft R Open 3.5.2 and 3.5.3

David Smith announces Microsoft R Open 3.5.2 and reveals when 3.5.3 comes out:

It’s taken a little bit longer than usual, but Microsoft R Open 3.5.2 (MRO) is now available for download for Windows and Linux. This update is based on R 3.5.2, and accordingly fixes a few minor bugs compared to MRO 3.5.1. The main change you will note is that new CRAN packages released since R 3.5.1 can now be used with this version of MRO.

David also lets us know that they’re working on 3.6.0’s release.

Exploratory Data Analysis on Categorical Variables

Giorgio Garziano continues digging into earthquake data:

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

Read on for various techniques for data analysis against categorical variables.

Problems Distributed Systems Experience

RJ Zaworski gives us examples of the types of problems you can run into with distributed systems:

Time limits: ending the neverending
Here’s one to ponder: how long can a long-running action go on before the customer (even a very patient, very digital customer) loses all interest in the outcome?
Pull up a chair. With no upper bound, we could be here a while.

Read on for more in that vein with JavaScript-y solutions.

Deploying SSIS Packages with Powershell

Aaron Nelson shows us how we can deploy an Integration Services ISPAC into the SSIS catalog with Powershell:

In my last post, I showed how you can use the SSIS PowerShell Provider to execute an SSIS package with PowerShell.  Of course, in order to execute that SSIS package, it has to get deployed first.  In Part 5 of Andy Leonard’s “SSIS, Docker, and Windows Containers” series he used some PowerShell code from Matt Masson’s blog post to deploy an .ISPAC file to the SSIS catalog.

Click through for the code.

Figuring Out SSIS Memory Requirements

Tim Mitchell tries to give us a better answer for SSIS memory requirements than “all of it and then some”:

When planning for memory needs, it is critical to understand how SQL Server Integration Services uses memory. SSIS will allocate memory from the unallocated system memory for each package executed, and surrenders that memory shortly after the package completes its execution. The memory allocated for SSIS package executions runs in the SSIS execution runtime process (ISServerExec.exe, if you are executing the package from the SSIS catalog).

Here’s where the package design has a significant impact on memory use. If a package uses an SSIS data flow, all of the data passing through that data flow is written to memory used by SSIS. For example, consider a package that loads 10 million rows from a flat file to a table. In this case, all 10 million rows will pass through the SSIS memory space during package execution.

Read on as Tim goes into good detail on the topic.

What Makes for Good Coding Style

Brent Yorgey spends some time thinking about good coding style:

What is good code style? You probably have some opinions about this. In fact, I’m willing to bet you might even have some very strong opinions about this; I know I do. Whether consciously or not, we tend to frame good coding practices as a moral issue. Following good coding practices makes us feel virtuous; ignoring them makes us feel guilty. I can guess that this is why Yom said “I don’t think I could bring myself to be satisfied with partial functions” [emphasis added]. And this is why we say “good code style”, not “optimal” or “rational” or “best practice” code style.

It’s an interesting post. There are some bits on competitive programming which don’t apply in general, but there’s a lot to unpack there.

Singular or Plural Table Names

Ed Elliott kicks a hornet’s nest:

There is a lot of confusion when it comes to designing tables in SQL Server around whether to pluralize names or not. How do you choose whether to pluralize or not?

If we want to store a list of people and their details do we use “Person”, “Persons”, “People” or “Peoples”? Some people will use “People” and some will use “Person”, other persons or people would go for “Peoples” or “Persons”.

My preference is singular. In the event that I do pluralize a table, I use a grammatically correct pluralization. None of this “childs” nonsense.

Troubleshooting Database Compatibility Levels

Randolph West tells a tale about checking compatibility levels:

In that demo, the AdventureWorks sample database was initially set to compatibility level of 140 (SQL Server 2017 default compatibility) to execute a scalar UDF. At this point, the estimated execution plan showed that the UDF was given a cost of 0%, and performance was terrible (the expected behaviour). Then the database compatibility level was switched to 150 (which is all that’s required to enable this new optimization feature), the query was executed again, the UDF was inlined, and performance improved dramatically.

This is where it got interesting. As a test, the compatibility level of the database was set back to 140, but the query plan continued to inline the UDF. Curious. Flushing the plan cache didn’t change the outcome (even though we knew it wasn’t necessary). Had we discovered a bug in a preview version of SQL Server 2019? It was CTP 2.2 after all, and since then (at the time of this writing) CTP 2.5 is already available.

Read on for the answer.

Data Cleansing Options with Azure

James Serra tries to give you an answer of when you should use different Azure services for data cleansing:

Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option

Read on for additional options and James’s recommendations.


May 2019
« Apr