XML In Scala

Mahesh Chand Kandpal shows how to create XPath statements in Scala:

We called the \() on the XML element and asked it to look for all symbol elements.  It returns an instance of scala.xml.NodeSeq, which represents a collection of XML nodes.

The \() method looks only for the elements that are direct descendants of the target element(i.e symbol).   If we want to search through all the elements in the hierarchy starting from the target element, \\() method is used

Check it out, especially if you’re working with Spark, as you never know when a rogue XML file will head your way.

Domain, Range, And Codomain

Kevin Sookocheff explains the concepts of domain, range, and codomain:

That is, a function relates an input to an output. But, not all input values have to work, and not all output values. For example, you can imagine a function that only works for positive numbers, or a function that only returns natural numbers. To more clearly specify the types and values of a functions input and output, we use the terms domain, range, and codomain.

Speaking as simply as possible, we can define what can go into a function, and what can come out:

  • domain: what can go into a function

  • codomain: what may possibly come out of a function

  • range: what actually comes out of a function

Read on for more, including a couple of examples.  These are important concepts for learning functional programming.

Running Cassandra On EC2

Prasad Alle and Provanshu Dey share some tips if you’re running Cassandra on Amazon’s EC2:

Apache Cassandra is a commonly used, high performance NoSQL database. AWS customers that currently maintain Cassandra on-premises may want to take advantage of the scalability, reliability, security, and economic benefits of running Cassandra on Amazon EC2.

Amazon EC2 and Amazon Elastic Block Store (Amazon EBS) provide secure, resizable compute capacity and storage in the AWS Cloud. When combined, you can deploy Cassandra, allowing you to scale capacity according to your requirements. Given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.

In this post, we outline three Cassandra deployment options, as well as provide guidance about determining the best practices for your use case in the following areas:

  • Cassandra resource overview

  • Deployment considerations

  • Storage options

  • Networking

  • High availability and resiliency

  • Maintenance

  • Security

Click through to see these tips.

The Basics Of Bash: Writing Data

Mark Wilkinson hits us with some basic Bash output management:

If you have experience with PowerShell, some properties of Bash variables will feel familiar. In Bash, variables are denoted with a $ just like in PowerShell, but unlike PowerShell the $ is only needed when they are being referenced. When you are assigning a value to a variable, the $ is left off:

#!/bin/bashset -eset -umy_var="World"printf "Hello ${my_var}\n"

Above we assigned a value to my_var without using the $, but when we then referenced it in the printf statement, we had to use a $. We also enclosed the variable name in curly braces. This is not required in all cases, but it is a good idea to get in the habit of using them. In cases where you are using positional parameters above 9 (we’ll talk about this later) or you are using a variable in the middle of a string the braces are required, but there is no harm in adding them every time you use a variable in a string.

The basic syntax is pretty familiar to most programming languages, and there’s nothing scary about outputs, even when Mark starts getting into streams.

Using The Power Query SDK

Chris Webb shows how to build M queries in Visual Studio:

Writing M in the Advanced Editor in Excel or Power BI can be a frustrating experience unless you’re the kind of masochist who loves writing code in Notepad. There are some options for writing M code outside Excel and Power BI, for example Lars Schreiber’s M extension for Notepad++ (see here for details) or the M extension for Visual Studio Code (available from the Visual Studio Marketplace here; more details on Brett Powell’s blog here), but the trouble with them is that you have to copy the code back into Excel or Power BI to run it. What many people don’t realise, however, is that it is possible to write M code and have IntelliSense, formatting, keyword highlighting and also the ability to execute your own M queries, using the Power Query SDK in Visual Studio.

The Power Query SDK (which you can download here) supports Visual Studio 2015 and 2017 and is intended for people who are writing custom Data Connectors for Power BI. To let you test your Data Connector you can create a .pq file containing M code, and this in fact allows you to run any M query you want whether you’re building a Data Connector or not.

And then, once you get comfortable with M, start learning F#.  That will allow you to laugh haughtily at those poor object-oriented sods out there.

An Introduction To Splunk

Victoria Holt has some basics on Splunk:

Splunk, a software platform, has the capability to leverage machine data for data management and analytics.  It can be used for

  • Data driven decision making
  • Alerts for network security threats
  • Report on system failures
  • Analyse and improve functionality

It enables performance analysis, dashboard creation, monitoring, troubleshooting and investigation of the real-time data collected. A Edureka learning video showed the Splunk components.

Advanced Splunk queries are still a bit like magic to me, but this is a very powerful service once you get a handle on how it works.

Analyzing Spatial Data With Cosmos DB

Ben Jarvis shows how to query spatial data from Cosmos DB:

The above code connects to Cosmos DB and retrieves the details for the base airfield that was specified, it then calculates the range of the aircraft in meters by multiplying the endurance (in hours) by the true airspeed in knots (nautical miles per hour) and then multiplying that my 1852 (number of meters in a nautical mile). A Linq query is then run against Cosmos DB using the built-in spatial functions to find airfields within the specified distance. The result is then converted into a JSON array that can be understood by the Google Maps API that is being used on the client side.

The client side uses the Google Maps API to plot the airfields on a map, giving us a view like the one below when given a base airfield of Blackbushe (EGLK), a true airspeed of 100kts and an endurance of 4.5 hours

Click through for .NET code to load and analyze the data.

Bash For The Powershell-Minded

Mark Wilkinson has started a new series on Bash.  His first post is an introduction to the scripting language:

Bash (the Bourne Again Shell) was created in 1989 for the GNU Project as a free replacement for the Unix Bourne shell. Most modern Linux systems use Bash as their default command line shell, so if you have ever dropped to a command line on a Linux system, you have probably used Bash. Just like PowerShell, Bash is both a scripting language and a command shell/interpreter. So not only can you execute commands in an interactive shell session, but you can also write scripts that incorporate multiple commands.

Once you get your hands dirty with Bash you’ll notice a lot of features that were incorporated into PowerShell. Things like command substitution: $(Get-Date) were directly pulled from Bash $(date). Other features will look familiar as well, like the ability to pipe multiple commands together.

One thing you need to understand right away is that Bash is string based, not object based like PowerShell. This means you’ll find yourself doing a lot more string processing to get tasks done. Things like string splitting will be much more common. Bash does support objects, like arrays, but few if any commands output an array. As we go through this series you’ll see that this might not be as limiting as it sounds.

The best part about learning Bash is that you can then get into arguments about Bash vs ksh vs zsh.

A Definition Of Functional Programming

Kevin Sookocheff contrasts functional programming with its imperative cousin:

Functional programming is a form of declarative programming that expresses a computation directly as pure functional transformation of data. A functional program can be viewed as a declarative program where computations are specified as pure functions.

I think that if you’re a set-based SQL developer, functional programming languages will make the most intuitive sense.  They’re a bit harder to wrap your mind around if you’ve grown up as an imperative C-style developer, but are still worth the effort.

AWS Glue Now Supports Scala

Mehul Shah, et al, announce that AWS Glue officially supports Scala:

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

Functional languages tend to be very good for ETL tasks, and Scala is a great choice due to its relationship with Spark.


March 2018
« Feb