Identifying Distributions with knn in R

Abhijit Telang has an interesting post on identifying arbitrary distributions with the k-nearest-neighbor algorithm in R:

You can easily see how arbitrary the shapes can be almost magically discovered, through the principle of the nearest neighbor search.

The magic happens because the methodical approach of meeting and greeting the neighbors discovers more and more neighbors (and hence the visualization becomes denser and denser) as per the formation of the shape, and on the other hand, sparser and sparser as the traversal approaches the contours of those very shapes. The sparseness around the dense shapes provides the much-needed contrast to discover hidden shapes.

Read on for a very interesting explanation.

Shoestring Budget Keyword Bidding

Steph Locke shares some experience on keyword bidding without breaking the piggy bank:

To be able to do the next stage I was interested in a solution that would give me an API or a bulk-upload interface for free or cheap. Google Ads has a Keyword Planner but having to create a campaign before I could even get started wasn’t fun. Another recommendation I’d had for keyword analysis was and I noticed they had a $7 trial and an API that looked well documented. The API didn’t seem to cover keywords but it did have a bulk upload capability.

Some of what Steph’s doing is possible within AdWords reporting, but they will tend toward finding helpful ways of maximizing your spend.

New Version of ggforce Available

Thomas Lin Pedersen announces a new version of ggforce for R:

If there is one thing of general utility lacking in ggplot2 it is probably the ability to annotate data cleanly. Sure, there’s geom_text()/geom_label()but using them requires a fair bit of fiddling to get the best placement and further, they are mainly relevant for labeling and not longer text. ggrepelhas improved immensely on the fiddling part, but the lack of support for longer text annotation as well as annotating whole areas is still an issue.

In order to at least partly address this, ggforce includes a family of geoms under the geom_mark_*() moniker. They all behaves equivalently except for how they encircle the given area(s). 

There are some really interesting features in the ggforce package, so check them out.

When Faster Disk Increases WRITELOG Waits

Paul Randal explains that WRITELOG waits can potentially increase as you get faster disk:

I was contacted last week by someone who was confused about the WRITELOG wait type. They were seeing lots of these waits, with an average wait time of 18ms. The log was stored on a Raid-1 array, using locally-attached spinning disks in the server. They figured that by moving the log to Raid-1 array of SSDs, they’d reduce the WRITELOGwait time and get better workload throughput.

They did so and got better performance, but were very surprised to now see WRITELOG as the most frequent wait type on the server, even though the average wait time was less than 1ms, and asked me to explain.

Read on for Paul’s explanation of why this is not a scary situation, or is it particularly weird. SQL Server performance is a complicated thing and trying to limit it to one measure or one query can lead you down the wrong path.

Enabling and Disabling Triggers

Kenneth Fisher shows how you can enable and disable triggers in SQL Server:

I tend to feel that a lot of people who use triggers don’t really understand them. That said, every now and again you have to deal with them. And in particular (for this post) you might need to disable and then re-enable them. Enabling and disable are identical commands in this case so I’m just going to use the DISABLEversion and you can just replace it with ENABLE as needed.

To be fair, the world might be a better place if we disabled a majority of triggers…

Changing Tables with Limited Downtime

I continue my series on near-zero downtime deployments and this time look at table changes:

The realm of More Significant Changes is not where you often want to be. There’s a lot of scaffolding code you need to write. Basically, suppose you want to make a repair on the 5th story exterior of an 8-story building. You have a couple of options: the YOLO option is to kick everybody out of the building and have people rappel from the top of the building down to the 5th story to make their changes. They need all of the people out of the building because shut up it’s a strained analogy. This approach is kind of inconvenient: people have to stay out of your building until your people are done repairing the exterior. That’s blocking in the database world.

On the other side, you can build a bunch of scaffolding and attach it to the exterior of the building, perform your repairs, and tear down that scaffolding. While the scaffolding is up, people come and go like normal and don’t even think about it. As you tear the scaffolding down, you temporarily block the door for a moment as you’re finishing your work. This is much more convenient for end users and fits the “near-zero downtime” approach we’re taking.

Strained analogies aside, this is a long post on making a series of table-related changes without your end users noticing.

Beware CPU Oversubscription with SQL Server

Monica Rathbun shares a tale of terror:

Recently I had a client complain of chronic high CPU utilization. The performance of their SQL Server had degraded, and it appeared to be related to higher than normal CPU utilization in conjunction with symptoms of unresponsive user queries.  The root cause was twofold—a third party hosting provider had overallocated virtual processors on the physical host where the virtual machine (VM) running SQL Server was residing, as well as a recent upgrade from a version of VMWare that was not patched for Spectre and Meltdown. The host had 16 physical cores and was hyperthreading (making it effectively 32 cores) until the hosting provider patched from VMWare 5.5 to a newer release (we believe 6.5) which was required for Meltdown and Spectre processor vulnerabilities. This patch disabled hyperthreading from the hypervisor to mitigate the security risk from speculative execution. Note, this patch is over a year old and a critical security risk; most software vendors (including VMWare) put this out as an immediate requirement after the announcement of the vulnerabilities.

Given this was a virtual machine, it shared a physical host with many other VMs; this is a very common configuration. However, this host was VERY overallocated.  As mentioned above, there were 16 cores–however 61 additional vCPUs had been allocated to other machines. That’s 4.3 times the number of CPUs available for allocation.  The screenshot below shows this singular Host, highlighting the vCPUs allocated.

So, uh, that’s a bad thing. Monica explains in detail why exactly it’s a bad thing, which is helpful when you’re trying to explain to the server admin why it’s a bad thing. CPU oversubscription can work for things like dev boxes or web servers, where they typically aren’t anywhere near 100% utilization. It does not work at all for busy database servers.

SSIS Catalog Dashboard

Tim Mitchell has a new GitHub repo:

The SSIS Catalog Dashboard is a simple collection of reports that provide insight into the activity within the SSIS catalog. The first of these is the Dashboard report. This report shows a summary of the number of packages that are running or have run in the recent past.

The dashboard repo, a Reporting Services project, is available on GitHub and is licensed under GPL version 3.

Performance Troubleshooting Plus Wait Stats

Jeff Mlakar builds up some thoughts on performance troubleshooting, including wait stats:

Queries go through the cycle of the SPIDS / worker threads waiting in a series like this. A thread uses the resource e.g. CPU until it needs to yield to another that is waiting. It then moves to an unordered list of threads that are SUSPENDED. The next thread on the FIFO queue of threads waiting then becomes RUNNING. If a thread on the SUSPENDED list is notified that its resource is available, it becomes RUNNABLE and goes to the bottom of the queue.

Click through for an analogy using a microwave and plenty more.


March 2019
« Feb