Press "Enter" to skip to content

Day: July 9, 2018

Reading JSON Data Using httr

Akshay Mahale shows us how to use a few tidyverse-friendly packages to read JSON data from web endpoints:

When it comes to R to consume such APIS we focus majorly on the package below:

  • httr This package takes it very seriously when we have to work with We data by exposing some very useful functions. It provides us with HTTP client to access APIS with GET/POST methods, passing query parameters, verifying fetched response wrt to data format and if error-free.

  • jsonlite In order to convert received JSON response to readable R Object or a data frame, jsonlite helps to convert json to R object and vice versa.

  • rlist To perform some additional manipulation on data structure of received JSON response rlist exposes some important methods list.select and list.stack. This methods are useful to get parsed json data into a tibble.

Read on for an example extracting data from a web endpoint.

Comments closed

Dealing With Heteroskedasticity

Bruno Rodrigues explains the notion of heteroskedasticity and shows ways of dealing with this issue in a linear regression:

This test shows that we can reject the null that the variance of the residuals is constant, thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC() function from the {sandwich} package (hence the choice for the header picture of this post):

lmfit %>% vcovHC() %>% diag() %>% sqrt()
## (Intercept) regionnortheast regionsouth regionwest
## 311.31088691 25.30778221 23.56106307 24.12258706
## residents young_residents per_capita_income
## 0.09184368 0.68829667 0.02999882

By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance matrix for the parameters. There are several ways to estimate such a HC matrix, and by default vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004) for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables are not statistically significant anymore.

The biggest problem with heteroskedasticity is that it can introduce bias in error terms.  That’s not the end of the world, but if the level of heteroskedasticity is serious enough, we want to find ways to account for it.  H/T R-Bloggers.

Comments closed

Building A Gantt Chart With ggplot2

Sebastian Sauer shows us how to build a gantt chart in R:

Of importance are only TaskPrevious_Evnet and Duration. In addition, we need an overall start date (“2019-03-01” in this case). Each subsequent task is assumed to follow neatly its predecessing event.

Our job is to compute the start date and end date of task given that we know the initial start date and the durations. As said, this procedure is based on the assumption that there is a frictionless and gapless sequence of tasks.

Read on for a code-heavy example.  I’ve always had a soft spot in my heart for gantt charts.

Comments closed

Configuring MSDTC

Ryan Adams gives us the skinny on the Microsoft Distributed Transaction Coordinator:

MSDTC configuration is not as straight forward as you might think.  It’s a different choice if you are using a local MSDTC, clustered MSDTC, on-premises, Azure, Failover Cluster, or Availability Group.  Every one of those variables leads you down a different path with different choices.  The goal of this article is to clarify those choices.

If you think you do not use the MSDTC, you better check again.  I frequently see folks using it that do not realize it.  The most common is Linked Servers.  You got it.  If you are using Linked Servers, you’re using the MSDTC.

You are no longer required to install an MSDTC with a SQL Failover Cluster starting in SQL Server 2008.  That does not mean you do not need an MSDTC configuration.  If you wanted to make your application database highly available, but your application did not use the MSDTC you were being required to install one anyway.  That limitation is what was removed, not that you simply didn’t need a clustered MSDTC.

Read the whole thing.

Comments closed

Using Wireshark

Julia Evans explains how she uses Wireshark:

Usually I use Wireshark to debug networking problems in production. My Wireshark workflow is:

  1. Capture packets with tcpdump (typically something like sudo tcpdump port 443 -w output.pcap
  2. scp the pcap file to my laptop (scp host:~/output.pcap .)
  3. Open the pcap file in Wireshark (wireshark output.pcap)

That’s pretty simple! But once you have a pcap file with a bunch of packets on your laptop, what do you do with it?

Wireshark is my go-to tool for diagnosing networking issues.

Comments closed

SQL Server 2008 (And R2) Nearing End Of Support

The SQL Server team reminds us that those SQL Server 2008 and 2008 R2 instances are nearing end of support:

SQL Server 2008 and 2008 R2 have had a tremendous run. But all good things come to an end, right? On July 9, 2019, Microsoft will end Extended Support, which means no more updates or support of any kind, potentially leaving you vulnerable to security and compliance issues.

The good news is, you still have plenty of time and options to avoid any heartburn caused by the technology “circle of life.” And we’ll lay out all of those options for you in a webinar on July 12.

Non-contrarian opinion:  get those old SQL Server instances updated.  Life is so much better on the 2012 and later branch.

Contrarian opinion:  core-based licensing had such a major impact on businesses that five years from now, we’ll still see more SQL Server 2008 and 2008 R2 in the wild than we see 2005 today.

Comments closed

Finding Dependencies On Tables

Lori Brown shows us a few methods for finding references to tables:

I use sys.sql_modules to check for references to strings in stored procedures. You might think that this would not be necessary since you can find that in sys.dm_sql_referencing_entities for those. However, what if you do a lot of stuff in stored procedures that uses dynamic SQL to create objects or other stored procs on the fly? sys.dm_sql_referencing_entities might miss those. For example, when I look in sys.sql_modules for all occurrences of Sales.Orders I find more.

This is necessarily an incomplete list, but at least it gets you started.

Comments closed

Package References And ssisUnit

Bartosz Ratajczyk shows us a few methods for setting package references in ssisUnit:

When you set the packages’ references in the ssisUnit tests you have four options for the source (StoragePath) of the package:

  • Filesystem – references the package in the filesystem – either within a project or standalone
  • MSDB – package stored in the msdb database
  • Package store – packages managed by Integration Services Service
  • SsisCatalog – references the package in the Integration Services Catalog

In this post, I will show you how to set the package reference (PackageRef) for each option.

Read on for an example of using each method.

Comments closed

The Hidden Performance Costs Of Collation Mismatch

Nate Johnson explains why you want collation consistency when joining tables together on varchar/nvarchar columns:

There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat.  The columns are of the same type.  Just different collations.

And when the collation on the join predicates is different, bad things happen. Let’s take CustomerNumber for example. On the ERP side, it’s a nvarchar(20) collate Latin1_General_100_CI_AS. On the internal & web apps side, it’s a varchar(20) collateSQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.

Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an int or bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.

Read on to see how Nate tries to dig himself out of this hole.

Comments closed