Press "Enter" to skip to content

Month: February 2018

Hortonworks DataFlow 3.1 Released

George Vetticaden and Haimo Liu announce Hortonworks DataFlow version 3.1:

Apache Kafka 1.0 support with full integration with HDF Services – Kafka 1.0 provides important new features including more stringent message processing semantics with support for message headers and transactions, performance improvements and advanced security options.

  • Apache Ambari support for Kafka 1.0 – Install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari.

  • Apache Ranger support for Kafka 1.0 – Manage access control policies (ACLs) using resource or tag-based security for Kafka 1.0 clusters.

  • New NiFi and SAM processors for Kafka 1.0 – New processors in NiFi and Hortonworks Streaming Analytics Manager (SAM) support Kafka 1.0 features including message headers and transactions.

Click through for the list of top changes.

Comments closed

Library Paths In R

Stacia Varga troubleshoots an issue integrating Power BI with R:

As I was putting together an example of using an R script as a Power BI data source, I ran into some issues on my development machine that was frankly driving me crazy. When I tried to run the query in Power BI with my R script (that ran successfully in the IDE, by the way), I kept getting this message:

DataSource.Error: ADO.NET: R script error.
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
  namespace 'scales' 0.3.0 is being loaded, but >= 0.4.1 is required
Error: package or namespace load failed for 'rnoaa'
Execution halted

Stacia’s answer works as long as the .libPaths() results match expectations.  Another idea would be to set the R_LIBS_USER user-level environment variable to the desired starting directory and that should force the directory in the environment variable to be first when calling .libPaths().

Comments closed

Data Masking Prior To SQL Server 2016

Daniel Hutmacher shows how to roll your own data masking with SQL Server:

Dynamic data masking is a neat new feature in recent SQL Server versions that allows you to protect sensitive information from non-privileged users by masking it. But using a brute-force guessing attack, even a non-privileged user can guess the contents of a masked column. And if you’re on SQL Server 2014 or earlier, you won’t have the option of using data masking at all.

Read on to see how you can bypass dynamic data masking, and for an alternative approach that uses SQL Server column-level security instead.

Click through for the demo.

Comments closed

Bash For The Powershell-Minded

Mark Wilkinson has started a new series on Bash.  His first post is an introduction to the scripting language:

Bash (the Bourne Again Shell) was created in 1989 for the GNU Project as a free replacement for the Unix Bourne shell. Most modern Linux systems use Bash as their default command line shell, so if you have ever dropped to a command line on a Linux system, you have probably used Bash. Just like PowerShell, Bash is both a scripting language and a command shell/interpreter. So not only can you execute commands in an interactive shell session, but you can also write scripts that incorporate multiple commands.

Once you get your hands dirty with Bash you’ll notice a lot of features that were incorporated into PowerShell. Things like command substitution: $(Get-Date) were directly pulled from Bash $(date). Other features will look familiar as well, like the ability to pipe multiple commands together.

One thing you need to understand right away is that Bash is string based, not object based like PowerShell. This means you’ll find yourself doing a lot more string processing to get tasks done. Things like string splitting will be much more common. Bash does support objects, like arrays, but few if any commands output an array. As we go through this series you’ll see that this might not be as limiting as it sounds.

The best part about learning Bash is that you can then get into arguments about Bash vs ksh vs zsh.

Comments closed

Handling Permissions Changes With Powershell

Drew Furgiuele has a process to store and then re-run rights grants on SQL Server databases:

Permission requirements for these environments can change over time, just like the code and data going into your databases. It’s hard to track permissions because a database permission is much more than just a user principal; database objects often contain permission definitions for GRANT and DENY states, and users may belong in certain database roles in one environment, but not another. This isn’t a big deal… until it is: sooner or later your data and code drift will be different than production, or maybe some new change really breaks an environment. Then, you’ll be asked to restore these environments to either an earlier version, or, more likely, you’ll be asked to “refresh” these editions to what is currently in production.

You probably already have a process for this, but how are you handling maintaining differences in permissions between environments? Wouldn’t it be nice if you had a way to quickly evaluate, store, and then re-apply permissions as part of refresh? Even better, wouldn’t it be cool if you could do this for all your databases on a given instance? Or what about all your instances in a given environment?

You can, and you can do it pretty easily with PowerShell.

My one problem with Drew’s otherwise-excellent post is that he approved far too many entry visas in the opening GIF.  100% deny, 0 problems.

Comments closed

Virtualize Data Or Move It?

James Serra contrasts data virtualization with traditional ETL moving data to a warehouse:

Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.

Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.

If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?

Read on for James’s thoughts.

Comments closed

Performance Problems With ADO.Net’s AddWithValue

Dan Guzman inveighs against using AddWithValue in ADO.Net:

The nastiness with AddWithValue is that ADO.NET infers the parameter definition from the supplied object value. Parameters in SQL Server are inherently strongly-typed, including the SQL Server data type, length, precision, and scale. Types in .NET don’t always map precisely to SQL Server types, and are sometimes ambiguous, so AddWithValue has to makes guesses about the intended parameter type.

The guesses AddWithValue makes can have huge implications when wrong because SQL Server uses well-defined data type precedence rules when expressions involve unlike data types; the value with the lower precedence is implicitly converted to the higher type. The implicit conversion itself isn’t particularly costly but is a major performance concern when it is the column value rather than the parameter value must be converted, especially in a WHERE or JOIN clause predicate. The implicit column value conversion can prevent indexes on the column from being used with an index seek (i.e. non-sargable expression), resulting in a full scan of every row in the table or index.

Read the whole thing.

Comments closed

Dealing With Dates In R

Mathew McLean shows how to convert strings to dates using a couple well-known packages and introduces flipTime:

The package flipTime provides utilities for working with time series and date-time data. The package can be installed from GitHub using

1
2
require(devtools)
install_github("Displayr/flipTime")

I will discuss only two functions from the package in this post, AsDate() and AsDateTime(). These are used for the conversion of date and date-time strings, respectively. These functions build on the convenience and speed of the lubridate function. Furthermore, the flipTime functions provide additional functionality (making them easier to use). The functions are smart about identifying the proper format to use. So the user doesn’t need to specify the format(s) as inputs. At the same time, both AsDate() and AsDateTime() are careful to not convert any strings to dates when they are not formatted as dates. Additionally, it will also warn the user when the dates are not in an unambiguous format.

Check it out.

Comments closed

ARIMA In R

Subhasree Chatterjee shows us how to use R to implement an ARIMA model:

Once the data is ready and satisfies all the assumptions of modeling, to determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

To examine which p and q values will be appropriate we need to run acf() and pacf() function.

pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.

ARIMA feels like it should be too simple to work, but it does.

Comments closed

Logs Are For Parsing

Tim Wilde shares an oft-forgotten truth:

How often have you found yourself contemplating some hair-brained regex scheme in order to extract an inkling of value from a string and wishing the data had just arrived in a well-structured package without all the textual fluff?

So why do we insist on writing prose in our logs? Take “Exception while processing order 1234 for customer abc123” for example. There are at least four important pieces of information drowning in that one sentence alone:

  1. An exception was raised!
  2. During order processing
  3. Order number 1234
  4. Customer abc123

Being an exception log message, it’s more than likely followed by a stack trace, too. And stack traces certainly don’t conform to carefully crafted log layout patterns.

Logging is something we tend to forget about and slap in at the last minute.  We also think about it from the viewpoint of a developer looking at a single error message.  Those are both mistakes that lead to a huge amount of extra work later.

Comments closed