Press "Enter" to skip to content

Month: June 2017

Joining Tables In SparkR

WenSui Liu has a script to join tables together in SparkR:

# INNER JOIN
showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = FALSE))
showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "inner"))
#+------+-------+------+-------+
#|month1|min_dep|month2|max_dep|
#+------+-------+------+-------+
#|     3|    -25|     3|    911|
#|     2|    -33|     2|    853|
#+------+-------+------+-------+

There’s no commentary, so it’s all script all the time.  H/T R-bloggers

Comments closed

Transactional Replication And Temporal Tables

Transactional replication found Drew Furgiuele’s little black book of T-SQL syntax and went ballistic when it found out Drew was seeing temporal tables on the side:

So let’s say you ran this script (or, maybe someone checked it in as a database change to production). For a while, things are great: you’re making changes to data on your publisher and things are flowing nicely to your subscribers. Sooner or later though, someone’s going to ask you to set up a new subscription (or maybe you need to reinitialize one). Let’s simulate that on my lab: we’re going to remove Person.Address from replication and we’re going to put it back, and then create a snapshot. The key difference here is that now, Person.Address has system versioning turned on. When we try and add the table back to the publication, we’re in for a shock:

This could come back to bite you, so if you use replication and are interested in temporal tables, read this closely.

Comments closed

Allowing Native Queries In An M Project

Cedric Charlier ran into an error running native queries in his Visual Studio M project:

I was just using it since a few days when I found an interesting case. My query had a native query

1
2
3
4
5
Sql.Database(
   “server”,
   “db”,
   [Query = “select * from myTable where field=” & value]
)

When I tried to execute it, I received a message from the Power Query SDK that

The evaluation requires a permission that has not been provided. Data source kind: ‘SQL’. Permission kind: ‘NativeQuery’.

Read on for the solution.

Comments closed

What Is DevOps?

Rob Farley explains the basics of DevOps:

Traditionally, developers would develop code without thinking much about operations. They’d get some new code ready, deploy it somehow, and hope it didn’t break much. And the Operations team would brace themselves for a ton of pain, and start pushing back on change, and be seen as a “BOFH”, and everyone would be happy. I still see these kinds of places, although for the most part, people try to get along.

With DevOps, the idea is that developers work in a way that means that things don’t break.

I know, right.

My tongue-in-cheek-or-maybe-not version of this is, DevOps is when you put developers in the on-call rotation.  This provides motivation to build tools that actually explain what’s going on and write code that plays nicer with others.

Comments closed

The Spirit Of DevOps

Andy Yun explains how he sees DevOps:

In the world of DevOps, an Operations team might utilize a monitoring tool that feeds useful directly back to Developers and Testers. Developers & Testers may cross train, so both learn how to effectively write automated unit tests. Developers & Testers could cross train with Operations, to improve application deployment automation processes.

These examples all share one common theme – teams reaching outside of their traditional skill boundaries, to actively engage, learn, and integrate. This active engagement is what has often been missing from traditional operations.

Andy’s post is a good example of the positive take on DevOps (and the one to which I subscribe).

Comments closed

Migrating Tables Using Powershell

Jana Sattainathan has a script to copy a table and its associated indexes from one database to another:

Recently I got a request from a user that he wanted to copy a specific set of tables and their indexes into a new database to ship to the vendor for analysis. The problem was that the DB had thousands of tables (8,748 to be precise). Hunting and pecking for specific tables from that is possible but tedious. Even if I managed to do that, I still have to manually script out the indexes and run them in target as the native “Import/Export Wizard” does not do indexes. It only copies the table structure and data! I am not a big fan of point and click anyway.

My first thought was to see if dbatools had something similar, though a quick glance at the command list says perhaps not yet.

Comments closed

Managing Data Lake Analytics Compute

Yan Li has a three-part series looking at management of Azure Data Lake compute.  First, an overview:

Scenario 2: Set One Specific Group to Different Limits

New members are joining and sharing the same ADLA account. To prevent any new members, who are just learning ADLA, from mistakenly submitting a job that consumes too much compute resource (increasing cost and blocking other jobs), customers want to set the maximum AU per job for new employees at 30 AUs while others can submit jobs with up to 100 AUs.

Default Policy:

  • Job AU limit: 100
  • Priority limit: 1

Exception Policy: New Employee Policy

  • Job AU limit: 30

  • Priority limit:  200

  • Group: New Employee Group

Next up is a look at job-level policies:

With job-level policies, you can control the maximum AUs and the maximum priority that individual users (or members of security groups) can set on the jobs that they submit. This allows you to not only control the costs incurred by your users but also control the impact they might have on high priority production jobs running in the same ADLA account.

There are two parts to a job level policy:

  • Default Policy: This is the policy that is applied to all users of the service.
  • Exceptions: The set of “exception” policies apply to specific users.

Submitted jobs that do not violate the job-level policies are still subject to the account level policies as described in Azure Data Lake Analytics Account Level Policy.

Finally, account-level policies:

ADLA supports three types of account-level policies:

  • Maximum AUs  — Controls the maximum number of AUs that can be used by running jobs

  • Maximum Number of Running Jobs  — Controls the maximum number of concurrently running jobs.

  • Days to Retain Job Queries  — Controls how long detailed information about jobs are retained in the users ADLS account.

There’s a good amount of information here.

Comments closed

Why Hadoop BI Projects Fail

Remy Rosenbaum lays out several reasons why he’s seen business intelligence projects on Hadoop fail:

In order to set up and run an effective Big Data Hadoop project that provides reliable BI, your organization will need to adopt a new mindset that addresses not only the technology, but also the organizational EIM. You will need to conduct a comprehensive analysis of your business with the help of analysts, internal domain experts, and strategists to come up with robust and relevant business use cases. You will also need buy-in from management, and take company politics into consideration.

Your Big Data project needs to work with your existing BI tools, along with your security and monitoring systems. Data security needs to be addressed because standard Hadoop implementations have relatively poor security, and many organizations are wary of keeping all their data in one location.

I do agree with these reasons, though I’m a bit surprised that I didn’t see much about “classic” BI problems like the inability of the company to standardize on terminology or definitions (e.g., what the Kimball method describes as conformed dimensions), the desire to tackle too much of the problem at once, rapidly-changing source systems (and how BI team members tend to be the last to know that something has changed), etc.

Comments closed

Cochran-Mantel-Haenszel Test

Mala Mahadevan explains the Cochran-Mantel-Haenszel test, with two parts up so far.  First, her data set:

Below is the script to create the table and dataset I used. This is just test data and not copied from anywhere.

Second, an introduction to the test itself and solutions in R and T-SQL:

This test is an extension of the Chi Square test I blogged of earlier. This is applied when we have to compare two groups over several levels and comparison may involve a third variable.
Let us consider a cohort study as an example – we have two medications A and B to treat asthma. We test them on a randomly selected batch of 200 people. Half of them receive drug A and half of them receive drug B. Some of them in either half develop asthma and some have it under control. The data set I have used can be found here. The summarized results are as below.

This series is not yet complete, so stay tuned.

Comments closed