Press "Enter" to skip to content

Month: March 2017

Air Travel Route Maps With ggplot2

Peter Prevos wants to create a pretty map of flights he’s taken:

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

The end result is imperfect (as Peter mentions, ggmap isn’t wrapping around), but does fit the bill for being eye-catching.

Comments closed

Continuous Deployment In A Box

Ed Elliott has been working on a very interesting project:

What does this do?

Unblock-File *.ps1 – removes a flag that windows puts on files to stop them being run if they have been downloaded over the internet.
.\ContinuousDeploymentFTW.ps1 – runs the install script which actually:

  • Downloads chocolatey
  • Installs git
  • Installs Jenkins 2
  • Guides you how to configure Jenkins
  • Creates a local git repo
  • Creates a SSDT project which is configured with a test project and ssdt and all the references that normally cause people problems
  • Creates a local Jenkins build which monitors your local git repo for changes
  • When code is checked into the repo, the Jenkins job jumps into action and…

If you check into the default branch “master” then Jenkins:

  • Builds the SSDT project
  • Deploys the project to the unit test database
  • Runs the tSQLt unit tests
  • Generates a deployment script for the “production” database

and what you have there is continuous delivery in a box

Click through for a video where Ed shows how it all works.

Comments closed

Replication Extended Events

Drew Furgiuele goes hunting for the most dangerous creature of all, replication-related extended events:

Extended events are great; they have all the goodness of profiler except you don’t use profiler. Win/win! More to the point, extended events let you quickly and easily view, sort, and aggregate events that occur on your instances. They also have powerful filters (really, a “where” clause) to limit noise. You have way more control over what you monitor, how you store the data, and how you view and use it. This makes them perfect use to track replicated transactions, since we want to measure at both an individual level and the aggregate.

I fired up management studio and went to “New Session” looking for some replication event goodness and I found…

… nothing. I tried looking for events that had even parts of the name replication in it. No such thing, apparently.

This doesn’t deter Drew and he ends up building some interesting events to infer the correct answers.

Comments closed

Ignoring LoadGeneratorLocationError

Melissa Connors shows how to ignore LoadGeneratorLocationError errors in Visual Studio load tests:

I use Visual Studio for performance testing and overhead analysis with the SentryOne products. Currently, I have Microsoft Visual Studio Enterprise 2015 Version 14.0.25431.01 Update 3 installed. Since the first edition of 2015 (possibly even Visual Studio 2013), I’ve received a LoadGeneratorLocationError during each Load Test execution.

Since I am running the test locally, this error is noise. Furthermore, no one wants to see an error in an otherwise successful test. It simply ruins the final results report. In addition, when the Load Test was created, “On-premise Load Test” was selected, which makes this frustrating. Possibly more frustrating is that it’s called “On-premise” when you get started in the New Load Test Wizard.

Read on for the answer.

Comments closed

Unnecessary, Mandatory Work

Lukas Eder lays out one of the biggest performance drains today:

We’re using 8x as much memory in the database when doing SELECT * rather than SELECT film, rating. That’s not really surprising though, is it? We knew that. Yet we accepted it in many many of our queries where we simply didn’t need all that data. We generated needless, mandatory work for the database, and it does sum up. We’re using 8x too much memory (the number will differ, of course).

Now, all the other steps (disk I/O, wire transfer, client memory consumption) are also affected in the same way, but I’m skipping those.

This article is absolutely worth reading and sharing with developers.

Comments closed

Finding Physical Row Location

Wayne Sheffield shows how to find the physical location of a row in SQL Server:

Acquiring the physical location of a row

SQL Server 2008 introduced a new virtual system column: “%%physloc%%”. “%%physloc%%” returns the file_id, page_id and slot_id information for the current row, in a binary format. Thankfully, SQL Server also includes a couple of functions to split this binary data into a more useful format. Unfortunately, Microsoft has not documented either the column or the functions.

Read on for two functions you can use to format this data more nicely, as well as a short re-write Wayne did to improve performance of one of them.

Comments closed

replyr

John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.

Comments closed

R 3.3.3 Released

David Smith alerts us to R 3.3.3:

The R core group announced today the release of R 3.3.3 (code-name: “Another Canoe”). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are planned for R 3.4.0, expected in mid-April.) Binaries for the Windows version are already up on the CRAN master site, and binaries for all platforms will appear on your local CRAN mirror within the next couple of days.

For now, I’m holding out until R 3.4.0.

Comments closed

WebHCat

Jiang Mouren has a two-parter on WebHCat.  First, how it works:

SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.

Then, how to debug issues:

2.1.2. WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

When this happens collect the following logs for further investigation:

/var/log/webchat. Typical contents of directory will be like

  • webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log is stdout of server is started.
  • webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.

Comments closed