Press "Enter" to skip to content

Month: January 2018

Tidytext 0.1.6

Julia Silge announces a new version of tidytext:

I am pleased to announce that tidytext 0.1.6 is now on CRAN!

Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fact that some of the packages in tidytext’s Suggests require recent versions of R. FUN TIMES. I was glad to get it working, though, because I know that we have users, some teaching on university campuses, etc, who are constrained to older versions of R in various environments.

There are some more interesting updates. For example, did you know about the new-ish stopwords package? This package provides access to stopword lists from multiple sources in multiple languages. If you would like to access these in a list data structure, go to the original package. But if you like your text tidy, I GOT YOU.

Read on for examples and grab the latest version.

Comments closed

Visual Principles

I have a post looking at three visual principles important to creating good dashboards:

In European languages, we read from left to right and from top to bottom.  In Middle Eastern languages like Hebrew and Arabic, we read from right to left and top to bottom.  In ancient Asian languages (particularly Chinese), we read from top to bottom and right to left, but in modern Chinese, we read left to right and top to bottom.  As far as Japanese goes, we read every which way because YOLO.  The way we read biases the way we look at things.

There has been quite a bit of research done on looking at where we look on a screen or on a page. I’m going to describe a few layouts, but focusing on research done on Europeans.  If you poll a group of Israeli or Saudi Arabian readers, flip the results.

Read the whole thing.  The second part of that comes out soon.

Comments closed

Welcome, CXCONSUMER

Erik Darling points out that CXCONSUMER is now a wait type in SQL Server:

According to Pedro’s slide, but not the ENTIRELY MISSING DOCUMENTATION, this wait is the “safe” type of parallelism wait.

It’s a good thing Pedro is a dutiful blogger, so we don’t have to pull our hair out while unfurling these mysteries.

Speaking of documentation, our new CXCONSUMER friend isn’t mentioned in Query Store Wait Stats, either.

This is a very useful addition.

Comments closed

Performance Testing Post-Updates

Joe Chang has some quick and dirty performance tests from SQL Server 2016 SP1 compared to SQL Server 2106 SP1 CU7 (the first post-Meltdown/Spectre release):

linear sum, SQL Server 2016 sp1 cu 7 bld 4466 and OS patches vs sp1 base
9% faster, 12% more CPU efficiently
individual queries range from 24% faster to 0.3% slower.
there probably is a penalty in the recent fixes, but fixes since SP1 also made improvements?

Click through for more details.  We’ll have to see a lot more testing to know, but that’s certainly not awful.

Comments closed

Incrementing Matches In Powershell Regex

Tom Rayner has an example of building multiple regex matches in Powershell:

In the PowerShell Slack, I recently answered a question along these lines. Say you have a string that reads “first thing {} second thing {}” and you want to get to “first thing {0} second thing {1}” so that you can use the -f  operator to insert values into those spots. For instance…

The question is: how can you replace the {}’s in the string to {<current number>}?

Read on for more details.

Comments closed

Active Directory And ElasticMapReduce

Bruno Faria shows how to use AWS’s CloudFormation to extend Active Directory into an AWS ElasticMapReduce cluster and run jobs via Kerberos:

In this example, you build a solution that allows Active Directory users to seamlessly access Amazon EMR clusters and run big data jobs. Here’s what you need before setting up this solution:

  • An AWS account
  • An Amazon EC2 key pair
  • A possible limit increase for your account (Note: Usually a limit increase will not be necessary. See the AWS Service Limits documentation if you encounter a limit error while building the solution.)

To make it easier for you to get started, I created AWS CloudFormation templates that automatically configure and deploy the solution for you. The following steps and resources are involved in setting up the solution:

  1. Create and configure an Amazon Virtual Private Cloud (Amazon VPC).
  2. Launch an Amazon EC2 Windows instance (Active Directory domain controller).
  3. Create an Amazon EMR security configuration for Kerberos and cross-realm trust.
  4. Launch an Amazon EMR cluster with Kerberos enabled and a cross-realm trust configuration.

You can use the AWS CloudFormation templates to complete each step individually, or you can deploy the entire solution through a single step.

Read the whole thing.

Comments closed

Digging Into The Data Professional Survey

Melissa Connors looks at the 2018 Data Professionals Salary Survey:

This report is filtered to the United States, Private sector, full-time employees, Job Titles with more than 50 results, all primary databases, a salary between $15,000 and $200,000, and a survey year of 2018.

On the top are employees who said they work remotely 0 days per week, the middle is office employees who telecommute 1-4 days per week, and the bottom is the true remote employee who does this 5+ days per week.

The overall median salaries were $97,316 for office employees, $111,500 for part time telecommuters, and $114,163 for full time remote employees, which led to the click-bait title of this post. 🙂 It’s possible that this is because only more senior or highly-valued employees feel comfortable working from home, or are even allowed to, depending on the company culture.

Click through to see all of Melissa’s findings.

Comments closed

Online Database Object Changes

Michael J Swart continues his online deployment series:

PROCEDURES:
Procedures are very easy to Blue-Green. Brand new procedures are added during the pre-migration phase. Obsolete procedures are dropped during the post-migration phase.

If the procedure is changing but is logically the same, then it can be altered during the pre-migration phase. This is common when the only change to a procedure is a performance improvement.

But if the procedure is changing in other ways. For instance, when a new parameter is added, or dropped, or the resultset is changing. Then use the Blue-Green method to replace it: During the pre-migration phase, create a new version of the procedure. It must be named differently and the green version of the application has to be updated to call the new procedure. The original blue version of the procedure is deleted during the post-migration phase. It’s not always elegant calling a procedure something like s_USERS_Create_v2 but it works.

This has been a great series so far, and the way he does deployments matches very closely to the way we do them.

Comments closed

Spark And NVMe

Alicja Luszczak, et al, introduce NVMe caching in the Databricks distribution of Spark:

A particularly important and widespread use case is caching the results of scan operations. This allows the users to eliminate the low throughput associated with reading remote data. For this reason, many users who intend to run the same or similar workload repeatedly decide to invest extra development time into manually optimizing their application, by instructing Spark exactly what files to cache and when to do it, and thus “explicit caching.”

For all its utility, Spark cache also has a number of shortcomings. First, when the data is cached in the main memory, it takes up space that could be better used for other purposes during query execution, for example, for shuffles or hash tables. Second, when the data is cached on the disk, it has to be deserialized when read — a process that is too slow to adequately utilize the high read bandwidths commonly offered by the NVMe SSDs. As a result, occasionally Spark applications actually find their performance regressing when turning on Spark caching.

Third, having to plan ahead and explicitly declare which data should be cached is challenging for the users who want to interactively explore the data or build reports. While Spark cache gives data engineers all the knobs to tune, data scientist often find it difficult to reason about the cache, especially in a multi-tenant setting, where engineers still require the results to be returned as quickly as possible in order to keep the iteration time short.

Read on for more details, as well as performance comparisons.

Comments closed