Press "Enter" to skip to content

Day: May 10, 2016

Distributed Unit Testing

Cloudera shows off their distributed unit testing framework:

This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.

Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!

Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.

This is an interesting look at how to scale out unit tests.  It’s a bit of a long read (especially with all the videos) but worth your time.

Comments closed

The Code Behind Power BI Parameters

Chris Webb shows us how to get to the M code used in query parameters:

From this you can see that the value returned by the parameter query is just a single piece of text – it’s the value “Monday” that is set as the Current Value, that’s to say the value returned by the parameter itself. The interesting stuff is all in the metadata record associated with the value. I blogged about metadata here, so you may want to read that post before going any further; it’s pretty clear that the fields in the metadata record correspond to the values set in the UI. All of the fields in the metadata record can be edited in the Advanced Editor if you want.

When the parameter is used in another query it is referenced like any other query value. For example, if you load the DimDate table from the Adventure Works DW sample database and use the parameter above to filter the EnglishDayNameOfWeek column then the code generated in the UI looks like this:

I’m sure that by next month, there will be a half-dozen new things added to this alone, given how fast the Power BI team can push features…

Comments closed

Deny Everything

Kenneth Fisher goes over grant, revoke, and deny for permissions:

This means that MyUser can not run a SELECT statement against any table, view or table valued function in the database.

That probably doesn’t sound like you are applying a permission does it? And that is probably where a lot of the confusion comes in. If, however, we take a look at the system views where the data resides then we can see proof that both commands, GRANT and DENY, add a permission.

Particularly interesting is exactly how the deny permission works—and that “deny” is in fact a “permission” in that you modify a permissions list.

Comments closed

Key Components For A Successful Project

Ginger Grant lists five key components for a successful data analysis project:

Security is an obvious consideration which needs to be addressed up front. Data is a very valuable commodity and only people with appropriate access should be allowed to see it. What steps are going to be employed to ensure that happens? How much administration is going to be required to implement it? These questions need to be answered up front.

I want to extend special thanks to Ginger for putting security as the top item on the list.  Also, this seems like a pretty good set of criteria for most projects, so definitely check it out.

Comments closed

Notes From A Biml User Group

There’s a Biml user group in Amsterdam and Koos van Strien took notes:

  • Historically, Varigence has always given away lots of their work for free, and they’ll continue to do so. There are few (maybe no) companies giving this percentage of their work away for free, without having the barrier set at “if you want to start working really, you need our paid product”)

  • When features are introduced as free, they will stay free forever. Sometimes this means the introduction of features in the free product needs to be postponed to see the complete impact.

  • According to Scott, this is shown in the release of Biml Express: they could’ve easily dropped some features and move it into the paid versions of Biml, but they didn’t. Only added new features.

  • The “free while in beta” announcement on Biml Online is mainly a lawyer thing – you can expect BimlOnline to remain free too.

  • If a good SaaS-model is developed, the tools will all be free. But we’re not there yet…

This sounds like it was a pretty long discussion with Scott Currie and I’m insanely jealous that there’s a Biml user group but it’s nowhere near me…

Comments closed

Waits And Latches

Paul Randal has come out with his comprehensive wait and latch type library:

I present to the community a comprehensive library of all wait types and latch classes that have existed since SQL Server 2005 (yes, it includes 2016 waits and latches).

The idea is that over time, this website will have the following information about all wait types and latch classes:

  • What they mean

  • When they were added

  • How they map into Extended Events (complete for all entries already)

  • Troubleshooting information

  • Example call stacks of where they occur inside SQL Server

  • Email link for feedback and questions

It’s not complete yet, but entries are thorough.

Comments closed

Azure SQL Database Q&A

Julie Koesmarno has a Q&A on Azure SQL Database:

Q: Is there going to be down time when I scale up/down? What’s going to happen to my existing connections?

Extracted from Change the service tier and performance level (pricing tier) of a SQL database:

Note that changing the service tier and/or performance level of a database creates a replica of the original database at the new performance level, and then switches connections over to the replica.No data is lost during this process but during the brief moment when we switch over to the replica, connections to the database are disabled, so some transactions in flight may be rolled back. This window varies, but is on average under 4 seconds, and in more than 99% of cases is less than 30 seconds. Very infrequently, especially if there are large numbers of transactions in flight at the moment connections are disabled, this window may be longer.

The duration of the entire scale-up process depends on both the size and service tier of the database before and after the change. For example, a 250 GB database that is changing to, from, or within a Standard service tier, should complete within 6 hours. For a database of the same size that is changing performance levels within the Premium service tier, it should complete within 3 hours.

Video by Joe Idziorek on Service Tiers and how to scale up and down using Azure Portal is available here.

Read the whole thing.  There are some great questions and answers in this set.

Comments closed

Large Sorts And Hashes

SQL Sasquatch looks at a scenario in which large sorts or hash operations can cause CPU to skew compared to page lookups per second:

The graph above has tempdb footprint (light blue) stacked on top of used query memory (dark blue) against the left vertical axis.  The green period has very limited use of query memory.  During the yellow period, a moderate amount of query memory was used.  During the red period, a large amount of query memory was used and at a number of points operations spilled into tempdb.  As query memory was used more extensively, the CPU:lookups/sec correlation was more disrupted.

Once fully considered, this makes sense: query memory is “stolen” from the database page buffer pool.  References to pages in the page pool are “page lookups”, but each time stolen query memory is poked and prodded… well, that’s not a page lookup.  But it has CPU cost.

Check out the whole thing; this is a thoughtful look at an interesting data oddity.

Comments closed

Real-Time Mapping

Alan Eng shows off some open source tools to visualize data on a map in real time:

Beautiful data visualizations reveal stories that numbers just cannot simply tell. Using visualizations, we can get a sense of scale, speed, direction, and trend of the data. Additionally, we can draw the attention of the audience – the key to any successful presentation – in a way that’s impossible with tabulations. While a tabular view of new online signups is informative for tracking, a dynamic map would provide a more captivating view and reveal dimensions that a table cannot.

Hence, I worked on a map visualization that depicts signups in real time. In this post we will walk through the tools used to construct this map and discuss the technology that allows the frontend to listen and to receive data from the backend. The code should be sufficient for the readers to build their own flavor of the real-time map visualization. Note that I’m not a front-end developer. I did this for the sake of curiosity!

We’ve seen Power BI achieve the same goals (e.g., here and here), but this lets you write some custom code to fit into applications.  On the database side, we tend not to think so much about good internal monitors.  We buy monitoring tools for our databases, but those don’t tell us if our applications are healthy.

Comments closed