Press "Enter" to skip to content

Category: Misc Languages

Grouping Data with Spark

Ed Elliott has two quick examples of grouping data in Spark:

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

Click through for the first example, GROUP BY.

Leave a Comment

Loading a Spark DataFrame in .NET

Ed Elliott shows how to get data and convert it into a Spark DataFrame using .NET:

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame?

This post will cover all the ways to get data into a DataFrame in .NET for Apache Spark.

Click through for several methods.

Leave a Comment

Apache Spark Performance Tuning

Tomaz Kastrun provides a few hints when performance tuning Apache Spark code:

DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.

Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.

Comments closed

Using Scala in a Databricks Notebook

Tomaz Kastrun take a look at the original Spark language:

Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.

Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.

Comments closed

Creating an app with Suave and F#

Diogo Souza shows off the Suave framework:

F# is the go-to language if you’re seeking functional programming within the .NET world. It is multi-paradigm, flexible, and provides smooth interoperability with C#, which brings even more power to your development stack, but did you know that you can build APIs with F#? Not common, I know, but it’s possible due to the existence of frameworks like Suave.io.

Suave is a lightweight, non-blocking web server. Since it is non-blocking, it means you can create scalable applications that perform way faster than the ordinary APIs. The whole framework was built as a non-blocking organism.

I will shout from the rooftops that data platform developers should learn functional programming. In the .NET space, that’s F#.

Comments closed

An Intro to Time Series Databases

Kyle Buzzell looks at time series databases:

As the name implies, a time series database (TSDB) makes it possible to efficiently and continuously add, process, and track massive quantities of real-time data with lightning speed and precision. While other database models have been used for these kinds of workloads in the past, TSDBs utilize specific algorithms and architecture to deal with their unique needs.

In this piece, we’ll take a deeper look at time series databases, including the unique needs of the workloads they’re built for, their benefits, common use cases, and the TSDBs out there.

Click through for an overview. Time series databases are definitely a niche product, but they are really good inside that niche.

Comments closed