Press "Enter" to skip to content

Month: April 2017

ggedit 0.2.0

Jonathan Sidi announces ggedit 0.2.0:

ggedit is an R package that is used to facilitate ggplot formatting. With ggedit, R users of all experience levels can easily move from creating ggplots to refining aesthetic details, all while maintaining portability for further reproducible research and collaboration.
ggedit is run from an R console or as a reactive object in any Shiny application. The user inputs a ggplot object or a list of objects. The application populates Bootstrap modals with all of the elements found in each layer, scale, and theme of the ggplot objects. The user can then edit these elements and interact with the plot as changes occur. During editing, a comparison of the script is logged, which can be directly copied and shared. The application output is a nested list containing the edited layers, scales, and themes in both object and script form, so you can apply the edited objects independent of the original plot using regular ggplot2 grammar.

This makes modifying ggplot2 visuals a lot easier for people who aren’t familiar with the concept of aesthetics and layers—like, say, the marketing team or management.

Comments closed

OLAP Limitations In Tableau

Tim Cost points out areas of friction when trying to use Tableau to connect to a multi-dimensional Analysis Services cube:

I love Tableau, I do NOT however, love working with Tableau when it is connected to an OLAP cube (like Microsoft SQL Server Analysis Services).  I don’t enjoy working with cube data in Tableau because basically all the coolest parts of Tableau won’t work or won’t work in the ways you might expect.  I don’t see this as a failing of Tableau, I lay the blame on the OLAP cube.  The main issue with working against a cube in Tableau is that you talk to a cube with MDX, where we talk to almost every other data source with SQL.  MDX (or Mind Destroying Expressions as I think of them), are just a huge pain to work with.  As hard as it is for ME to write MDX, for Tableau it’s even harder. Here are some things that you should consider before committing to a Tableau project with Microsoft SQL Server Analysis Services as a data source

Click through for ten such considerations.

Comments closed

Understanding DBCC OPENTRAN

Kevin Hill goes into detail on what DBCC OPENTRAN does:

I have verified that new records I inserted have been read by the log reader, AND distributed to the subscriber(s).  This means that while you are seeing

Oldest distributed LSN : (37:157:3)

There is not an error…just info.

If you have non-distributed LSNs, there is something to troubleshoot in the replication process which is way outside the scope of this post.  A non-distributed replicated transaction/LSN CAN cause some huge Log file growth, and need to be investigated.  If this happens frequently, use the TABLERESULTS option to log to a regular table and alert on it.

Good information here.

Comments closed

Learning Azure

Grant Fritchey notes that web searches won’t always take you to the latest version of documentation:

If you’re learning Azure and you research things using a search engine, then I strongly recommend you use the ability to limit your searches to the last year. Otherwise, you may be getting incomplete or incorrect data. At this precise moment, I’d say you need to limit your searches to Google (although I honestly hate recommending one of these tools over the other, let’s keep the competition fierce) because I was able to easily get the correct information within a couple of mouse clicks.

Grant’s post makes sense, and so does the search engine behavior:  in Grant’s case, those older cmdlet documentation links have been around longer and older resources tend to have a larger number of relevant linkbacks and clicks.  That’s also visible in SQL Server documentation, where sometimes you’ll land on the 2008R2 or 2012 version of documentation rather than 2016 or vNext.

Meanwhile, Victoria Holt has a bunch of resources for the Azure curious:

Here are a whole set of links to kick start your learning of Microsoft Azure services.

Introduction video

Changes to computer thinking – Stephen Fry explains cloud computing

That’s a good set of starting links.

Comments closed

Table Variables Use TempDB Too

Derik Hammer proves that classic, non-memory-optimized table variables use disk:

Table variables use tempdb similar to how temporary tables use tempdb. Table variables are not in-memory constructs but can become them if you use memory optimized user defined table types. Often I find temporary tables to be a much better choice than table variables. The main reason for this is because table variables do not have statistics and, depending upon SQL Server version and settings, the row estimates work out to be 1 row or 100 rows. In both cases these are guesses and become detrimental pieces of misinformation in your query optimization process.

It’s worth the read.

Comments closed

Scalable Data Analytics

David Smith covers a recent Microsoft Data Science team talk at Strata:

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you’ll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You’ll also learn how to create prediction engines from these trained models using the mrsdeploy package.

Check out the post as well as the tutorial David links.

Comments closed

Architecting Kafka Streams

Bill Bejeck walks through a scenario in which one might use Kafka Streams:

Now, you’ve defined your source and we can start creating processors that’ll do the work on the data. The first goal is to mask the credit card numbers recorded in the incoming purchase records. The first processor is used to convert credit card numbers from 1234-5678-9123-2233 to xxxx-xxxx-xxxx-2233. The Stream.mapValues method performs the masking. The KStream.mapValues method returns a new KStream instance that changes the values, as specified by the given ValueMapper, as records flow through the stream. This particular KStream instance is the parent processor for any other processors you define. Our new parent processor provides the masked credit card numbers to any downstream processors with Purchase objects.

Unfortunately, this article seems like a mixture of high-level and low-level information that appeals more to people who already know how Kafka Streams works, but it is nevertheless interesting.

Comments closed

Encrypting Kinesis Records

Temitayo Olajide shows how to use Amazon’s Key Management Service to encrypt and decrypt Kinesis messages:

In this post you build encryption and decryption into sample Kinesis producer and consumer applications using the Amazon Kinesis Producer Library (KPL), the Amazon Kinesis Consumer Library (KCL), AWS KMS, and the aws-encryption-sdk. The methods and the techniques used in this post to encrypt and decrypt Kinesis records can be easily replicated into your architecture. Some constraints:

  • AWS charges for the use of KMS API requests for encryption and decryption, for more information see AWS KMS Pricing.

  • You cannot use Amazon Kinesis Analytics to query Amazon Kinesis Streams with records encrypted by clients in this sample application.

  • If your application requires low latency processing, note that there will be a slight hit in latency.

Check it out, especially if you’re thinking about streaming sensitive data.

Comments closed

Rolling A Log Analytics System

Michael Sun and Jeff Shmain put together a log analytics sytem using several technologies:

This is an example of tiered system design. Tiered system is a system design pattern where data is categorized and stored in different data stores that match best to each category. It can both improve performance and lower the cost of a system. One of the most famous tiered system designs is computer memory hierarchy.  In the log analytics use case, analysts mostly search for logs in recent months, but often run batch jobs to get long term trends from logs in recent years. Therefore, recent logs are indexed and stored in Solr for search, while years of logs are stored in HBase for batch processing. As such, the index in Solr is small, which both improves performance and reduces cost, among other benefits.

Although only months of logs are stored in Solr, the logs before that period are stored in HBase and can be indexed on demand for further analysis.

Now that we have covered a high level architecture of a log analytics system, we will dive into more details of individual components.

This looks like a solid architecture for a logging system and can apply to other cases as well.

Comments closed