Category: Misc Languages

AWS Glue Now Supports Scala

Published 2018-01-19 by Kevin Feasel

Mehul Shah, et al, announce that AWS Glue officially supports Scala:

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

Functional languages tend to be very good for ETL tasks, and Scala is a great choice due to its relationship with Spark.

Comments closed

Breeze: Mathematics In Scala

Published 2017-12-28 by Kevin Feasel

Nitin Aggarwal introduces the mathematics library behind Spark’s machine learning library, MLlib:

In simple terms, Breeze is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark

Breeze comprises four libraries:

breeze-math: Numerics and Linear Algebra. Fast linear algebra backed by native libraries (via JBlas) where appropriate.
breeze-process: Tools for tokenizing, processing, and massaging data, especially textual data. Includes stemmers, tokenizers, and stop word filtering, among other features.
breeze-learn: Optimization and Machine Learning. Contains state-of-the-art routines for convex optimization, sampling distributions, several classifiers, and DSLs for Linear Programming and Belief Propagation.
breeze-viz: (Very alpha) Basic support for plotting, using JFreeChart.

Read on for samples and basic usage.

Comments closed

The Triumph Of Functional Programming

Published 2017-12-20 by Kevin Feasel

Amanda LeClair and Michael Facemire have a new report on functional programming:

The customer-facing software development world is outgrowing stateful, object-oriented (OO) development. The bar for great, intuitive customer experience has been raised by ambient, conversation-driven user interfaces, like through Amazon Alexa. Functional programming allows enterprises to take better advantage of compute power to deliver those experiences at scale; better flexibility for delivering the right output; and a more efficient way of delivering customer value. FP also reduces regression defects in code, simplifies code creation and maintenance, and allows for greater code reuse.

Just as object-oriented programming (OOP) emerged as the solution to the limitations of procedural programming at the dawn of the internet boom in the mid-’90s, FP is emerging as the solution to the limitations of OOP today. The shift is already underway– 53% of global developers reported that at least some teams in their companies are practicing functional programming and are planning to expand their usage.

Alexey Sommer notes that functional programming has been sneaking into C# bit by bit for well over a decade:

Retrospective

C# 1.0 Visual Studio 2002

C# 1.1 Visual Studio 2003 – #line, pragma, xml doc comments

C# 2.0 Visual Studio 2005 – Generics, Anonymous methods, iterators/yield, static classes

C# 3.0 Visual Studio 2008 – LINQ, Lambda Expressions, Implicit typing, Extension methods

C# 4.0 Visual Studio 2010 – dynamic, Optional parameters and named arguments

C# 5.0 Visual Studio 2012 – async/await, Caller Information, some breaking changes

C# 6.0 Visual Studio 2015 – Null-conditional operators, String Interpolation

C# 7.0 Visual Studio 2017 – Tuples, Pattern matching, Local functions

I strongly believe that if you are a database developer and need to pick up a non-SQL programming language, functional languages will be a lot easier for you to get than object-oriented languages. Many of the principles line up much smoother with functional languages, as you can most clearly see with the relationship between Scala and Spark.

Comments closed

Pipes And More Pipes In R

Published 2017-12-19 by Kevin Feasel

Gabriel (de Selding?) has a tutorial on how to use the various pipes in R:

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the +in a ggplot2 statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Auto-recommended for the F# love, and a good tutorial to boot.

John Mount has a few interesting notes on the topic:

data.table has essentially used the square bracket sequence “][” in a manner equivalent to piping in R since about 2006. Here is an example.
The Bizarro Pipe “->.;” has always been possible in R, though I worked it out and started teaching it in 2016. And it is as quick to type as any other notation once you bind it to a keyboard shortcut.

Read on for the rest of his notes, too.

Comments closed

Error Handling In Scala

Published 2017-12-15 by Kevin Feasel

Manish Mishra gives a few examples of how to handle errors in Scala:

Try[T] is another construct to capture the success or a failure scenarios. It returns a value in both cases. Put any expression in Try and it will return Success[T] if the expression is successfully evaluated and will return Failure[T] in the other case meaning you are allowed to return the exception as a value. However with one restriction that it in case of failures it will only return Throwable types:
def validateZipCode(zipCode:String): Try[Int] = Try(zipCode.toInt)
But Throwing an exception doesn’t make much sense here since it is not much of a calculation. Although we can take this example to understand the use case. If the given string is not a number, it will be a failure. The value from the Try can be extracted in same as Option. It can be matched

As you write more complicated Spark operations, handling errors becomes critical.

Comments closed

Azure Functions Basics

Published 2017-11-28 by Kevin Feasel

Vincent-Philippe Lauzon explains the basics of Azure Functions:

In general, serverless refers to an economical model where we pay for compute resources used as opposed to “servers”.

Wait… isn’t that what the Cloud is about?

Well, yes, on a macro-scale it is, but serverless brings it to a micro-scale.

In the cloud we can provision a VM, for example, run it for 3 hours and pay for 3 hours. But we can’t pay for 5 seconds of compute on a VM because it won’t have time to boot.

A lot of compute services have a “server-full” model. In Azure, for instance, a Web App comes in number of instances. Each instance has a VM associated to it. We do not manage that VM but we pay for its compute regardless of the number of requests it processes.

In a serverless model, we pay for micro-transactions.

This is the first part in a series and is aimed at giving a conceptual explanation.

Comments closed

Building Dynamic Row Headers With ML Services

Published 2017-11-17 by Kevin Feasel

Dave Mason tries to get around his RESULT SETS limitation when using SQL Server Machine Learning Services:

The columns in the data frame clearly have names, but SQL Server isn’t using them. The data frame columns have types in R too (more on this in a moment). Now that makes me wonder about the data types for the data returned by SQL. How is that determined? If SQL isn’t using the column names, can I assume it isn’t making use of the R column types either?

For a point of reference, let’s run some more R code to show the column names and types. As before, the rvest package is used to scrape a web page, with each HTML <table> found becoming a data frame in the “tables” list (line 3). A data frame of table metadata is created by calling data.frame(). The first parameter is a vector of column names (line 4), the second parameter is a vector of column classes (line 5), and the third parameter causes the row “names” to be incrementing digits (line 6).

This is a work in progress as Dave continues his series.

Comments closed

Basics Of Elasticsearch In .NET

Published 2017-10-16 by Kevin Feasel

Ivan Cesar gives us a brief tutorial of the Elasticsearch .NET API:

To be able to search something, we must store some data into ES. The term used is “indexing.”

The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be using Entity Framework (EF) in this tutorial.

Generally, when using Elasticsearch, you are probably looking for a site-wide search engine solution. You will either use some sort of feed or digest, or Google-like search which returns all the results from various entities, such as users, blog entries, products, categories, events, etc.

These will probably not just be one table or entity in your database, but rather, you will want to aggregate diverse data and maybe extract or derive some common properties like title, description, date, author/owner, photo, and so on. Another thing is, you probably won’t do it in one query, but if you are using an ORM, you will have to write a separate query for each of those blog entries, users, products, categories, events, or something else.

Check out Ivan’s tutorial for several examples. Elasticsearch is really good for text-based search and simple aggregations, but it probably shouldn’t be a primary data store for any data you really care about.

Comments closed

Working With CosmosDB

Published 2017-10-11 by Kevin Feasel

Derik Hammer has an introductory article showing how to work with CosmosDB to store and use document-style data:

Querying Cosmos DB is more powerful and versatile. The CreateDocumentQuery method is used to create an IQueryable<T> object, a member of System.Linq, which can output the query results. The ToList() method will output a List<T> object from the System.Collections.Generic namespace.

Derik also shows how to import the data into Power BI and visualize it. It’s a nice article if you’ve never played with CosmosDB before.

Comments closed

Comparing Tree Graphs In SQL

Published 2017-10-09 by Kevin Feasel

Dmitriy Vlasov shows how to compare two trees in PL/SQL:

During the day, various changes are received by the accounting system from the design system. Production planning is based on the data from the accounting system. Conditions allow you to accept all the changes for the day and recalculate the product specification at night. However, as I wrote above, it is unclear how the yesterday state of the product differs from the today one.

I would like to see what was removed from the tree and what was added to it, as well as which part or assembly replaced another one. For example, if an intermediate node was added to the tree branch, it would be wrong to assume that all the downstream elements were removed from the old places and added to the new ones. They remained where they were, but the insert of the mediation node took place. In addition, the element can ‘travel’ up and down only within one branch of the tree due to the specifics of the manufacturing process.

This is Oracle-specific; migrating it to another platform like SQL Server would take a bit of doing.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31