Ed Elliott continues a series on SQL syntax concepts in Spark:
The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”.
Read on for the answer.
Comments closedA Fine Slice Of SQL Server
Ed Elliott continues a series on SQL syntax concepts in Spark:
The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”.
Read on for the answer.
Comments closedEd Elliott has two quick examples of grouping data in Spark:
I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.
Click through for the first example, GROUP BY
.
Ed Elliott shows how to get data and convert it into a Spark DataFrame using .NET:
When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame?
This post will cover all the ways to get data into a DataFrame in .NET for Apache Spark.
Click through for several methods.
Comments closedSergey Tihon notices a problem:
After reading all these 3 samples I realised that I do not fully understand what is Label column is used for. Later I came to a conclusion that all three samples most likely are incorrect and here is why.
Click through for a description of the problem as well as the answer.
Comments closedTomaz Kastrun provides a few hints when performance tuning Apache Spark code:
DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.
Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.
Comments closedTomaz Kastrun take a look at the original Spark language:
Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.
Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.
Comments closedMike Bronowski does a thing I don’t want to do:
While ago I have written a post about saving Outlook attachments with PowerShell and that that was actually the thing I learned from the topic I want to describe today.
I could not use PowerShell at that moment (security, security), so had to figure it out in the most common scripting language in office – VBA.
Read on to learn how.
Comments closedDiogo Souza shows off the Suave framework:
F# is the go-to language if you’re seeking functional programming within the .NET world. It is multi-paradigm, flexible, and provides smooth interoperability with C#, which brings even more power to your development stack, but did you know that you can build APIs with F#? Not common, I know, but it’s possible due to the existence of frameworks like Suave.io.
Suave is a lightweight, non-blocking web server. Since it is non-blocking, it means you can create scalable applications that perform way faster than the ordinary APIs. The whole framework was built as a non-blocking organism.
I will shout from the rooftops that data platform developers should learn functional programming. In the .NET space, that’s F#.
Comments closedPatrick Smacchia looks at two new operators in C#:
C#8 added the index
^
and range..
operators. In this post I am attempting to demystify both in the most comprehensive way.
Read on for the demos. I’m not sure I like how the range operator is exclusive on the right-hand side, but I suppose it’s just a matter of remembering that in this language, it’s exclusive and in others it can be inclusive.
Comments closedKyle Buzzell looks at time series databases:
As the name implies, a time series database (TSDB) makes it possible to efficiently and continuously add, process, and track massive quantities of real-time data with lightning speed and precision. While other database models have been used for these kinds of workloads in the past, TSDBs utilize specific algorithms and architecture to deal with their unique needs.
In this piece, we’ll take a deeper look at time series databases, including the unique needs of the workloads they’re built for, their benefits, common use cases, and the TSDBs out there.
Click through for an overview. Time series databases are definitely a niche product, but they are really good inside that niche.
Comments closed