Press "Enter" to skip to content

Month: October 2017

Inline U-SQL Functions

Damien Widera shows us how to write inline functions in U-SQL:

Now let’s go to the new thing – undocumented usage of inline functions. My function is pretty simple and I can imagine that function you will write could be as simple as mine but your functions will probably do something more useful. To simplyfy the coding process you could use inline function in your USQL script and not have to write any code in the C# file.

The could could look like this:

1
2
3
4
DECLARE @in string = "/Data/Aircraft/2006ByMonth/{*}.tsv";
DECLARE @out string = "/Data/Aircraft/2006ByMonth/out/CSharpFunction.tsv";
DECLARE @func Func<int,int> = (s)=>{return s+1;};

Things which make languages look more like functional languages generally get a thumbs up from me.

Comments closed

Copying Azure Data Lake Databases

Yanan Cai shows how to copy Azure Data Lake databases for local debugging and development:

The concept of a database is used to group related data structures and functions together. ADLA users have databases in their production environment that contain tables, assemblies, table valued functions and other objects. Previously, when developing and tuning U-SQL queries on a local machine, developers would have to manually recreate everything in their production database. After coding they would have to identify any changes to the database and then update the production account’s database. This process took extra time and introduced errors without adding any value.

Using the Export Wizard, developers can clone the existing database environment and sample data directly to the local account. Developers can also choose to export only parts of the database to the local database. Follow below steps to export your U-SQL databases.

Click through for the step-by-step process.

Comments closed

More R Services Internals Spelunking

Niels Berglund continues his series on R Services internals:

What happens is that straight after the AuthenticateConnection you will hit the WriteAsync breakpoint twice, the same way as we see in Code Snippet 4. The first hit at WriteAsync sort of makes sense, as it ties up with the call-stack we see in Code Snippet 5. But what about the second WriteAsync (the one that causes the second package to be sent), where does that come from? To try to figure that out, we start with the call-stack for that particular WriteAsync.

Execute the code in Code Snippet 1 again, and continue to the second WriteAsync. When you hit the breakpoint do a kc again. The call-stack should now look somewhat like this (this time it is the full call-stack):

It’s interesting the kind of stuff you can find with Wireshark and a debugger.

Comments closed

External Memory Pressure With SQL On Linux

Anthony Nocentino explains how SQL Server on Linux reacts to memory pressure:

We can use tools like ps, top and htop to look our are virtual and physical memory allocations. We can also look in the /proc virtual file system for our process and look at the status file. In here we’ll find the point in time status of a process, and most importantly the types of memory allocations for a process. We’ll get granular data on the virtual memory allocations and also the resident set size of the process. Here are the interesting values in the status file we’re going to focus on today.

  • VmSize – total current virtual address space of the process

  • VmRSS – total amount of physical memory currently allocated to the process

  • VmSwap – total amount of virtual memory currently paged out to the swap file (disk)

The differences are going to be interesting for people to troubleshoot later, particularly if you look at SOS_SCHEDULER_YIELD and give a knee-jerk reaction that the problem is with CPU.

Comments closed

Powershell’s Switch Statement

Shane O’Neill shows us Powershell’s switch statement:

We send out notification emails if jobs fail as well but hey, we’re DBAs, we like a backup!

Since this used to be done manually, and since I like PowerShell, I created separate functions that can take individual log files and parse out the needed data in a nice, easy format.

My problem, and the reason for this post, was figuring out, based on the name of the file in a directory, which function to call…

Click through to see how to use the switch statement, as well as how to switch on an expression.

Comments closed

Page Ranking With Kafka Streams

Hunter Kelly walks through a page ranking algorithm:

Once you have the adjacency matrix, you perform some straightforward matrix calculations to calculate a vector of Hub scores and a vector of Authority scores as follows:

  • Sum across the columns and normalize, this becomes your Hub vector
  • Multiply the Hub vector element-wise across the adjacency matrix
  • Sum down the rows and normalize, this becomes your Authority vector
  • Multiply the Authority vector element-wise down the the adjacency matrix
  • Repeat

An important thing to note is that the algorithm is iterative: you perform the steps above until  eventually you reach convergence—that is, the vectors stop changing—and you’re done. For our purposes, we just pick a set number of iterations, execute them, and then accept the results from that point.  We’re mostly interested in the top entries, and those tend to stabilize pretty quickly.

This is an architectural-level post, so there’s no code but there is a useful discussion of the algorithm.

Comments closed

Stateful Processing In Spark Streaming

Bill Chambers and Jules Damji look at a couple of stateful scenarios within Spark Streaming:

No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?

Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.

Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.

In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application.  This sounds like it’s more about preventing bursty duplicates from sensors.

Comments closed

Data Lake Zones

Shannon Lowder walks us through a multi-zone approach to storing data in a data lake:

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

Worth reading the whole thing.

Comments closed

Multi-Parameter Website Scraping With Power Query

Callum Green shows how to build up a URL based off of multiple parameters, scraping data from a page for each permutation of parameters:

The sections highlighted in red are the parameters and sit in between some of the hard-coded URL text

Code Breakdown:

–          Text = http://www.boxofficemojo.com/monthly/?page=

–          Parameter = [Page]

–          Text = &view=calendargross&yr=

–          Parameter = [Year]

–          Text = &month=

–          Parameter = [Month]

–          Text = &p=.htm

This is a rather clever solution, and if your parameters are functionally dependent (unlike this example, where it was a simple cross join of the three domains), you can still use the solution the same way; you just need to populate your parameter combination table differently.

Comments closed