Press "Enter" to skip to content

Day: July 17, 2020

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed

Case-Insensitive Searches in Cosmos DB

Hasan Savran gives us a few techniques for case-insensitive searching in Azure Cosmos DB:

Data gets saved as JSON in Azure Cosmos DB. JSON documents supports string, numeric and Boolean data types. Because of limited data types, you might end up picking string data type for most of your data. You may need to use Cosmos DB’s string functions to help you in many situations. For example; you might need to find names start with bunch of characters or you might need to find data ends like a word or bunch of characters. You may need to find all data that has address contains a street name. What do you do in these cases? Azure Cosmos DB has couple of system functions to help you, I will focus on the following functions in this post. These are the functions just got updated in June 2020.

Click through to learn a bit about those money-saving updates.

Comments closed

Comparing Integration Services and Azure Data Factory

Tim Mitchell compares SQL Server Integration Services to Azure Data Factory:

For the better part of 15 years, SQL Server Integration Services (SSIS) has been the go-to enterprise extract-transform-load (ETL) tool for shops running on Microsoft SQL Server. More recently, Microsoft added Azure Data Factory (ADF) to its stable of enterprise ETL tools. In this post, I’ll be comparing SSIS and Azure Data Factory to share how they are alike and how they differ. I’ll also review the strengths and shortcomings of each, including the architectures in which each of these is likely to do well.

Read on for Tim’s thoughts on the subject. Tim lays out his biases up-front but also gives you a good feel for where both products are in their lifecycles.

Comments closed

Getting Started with Redshift

Rahul Mehta has a primer on AWS Redshift:

AWS Redshift is a columnar data warehouse service on AWS cloud that can scale to petabytes of storage, and the infrastructure for hosting this warehouse is fully managed by AWS cloud. Redshift operates in a clustered model with a leader node, and multiple worked nodes, like any other clustered or distributed database models in general. It is based on Postgres, so it shares a lot of similarities with Postgres, including the query language, which is near identical to Structured Query Language (SQL). This Redshift supports creating almost all the major database objects like Databases, Tables, Views, and even Stored Procedures. In this article, we will explore how to create your first Redshift cluster on AWS and start operating it.

I’m not really the biggest fan of Redshift around, but Rahul does a good job walking us through the basics of the product.

Comments closed

Handling Time Zones in Power BI

Teo Lachev has some tips on working with time zones in Power BI:

A client has Power BI models connected to Dynamics Online. Dynamics stores all dates in UTC instead of keeping the time offset, such as 7/14/2020 1:21:29 AM +00:00. Naturally, the users want to see dates localized to the US Eastern Time zone. Easy, right? Use the Power Query ToLocal time transformation (in the Transform ribbon, expand Time, and then click To Local) to offset with the desired number of hours. But there are a few issues with this approach:

Read on to understand why this approach won’t cut it, and to get one which will.

Comments closed

Incremental Pipline Development with Azure Data Factory

Andy Leonard shows how you can incrementally develop Azure Data Factory pipelines:

A friend pinged me recently to ask about rolling back Azure Data Factory (ADF) pipeline versions. My response was a question: Are you using source control with ADF? That did not help the current situation.

I thought of the way I often build ADF pipelines and shared my methodology, which is relatively simple (it has to be simple for me to understand it!):

Click through for Andy’s approach.

Comments closed

Getting Details by the Max Value of a Column in Power BI

Ed Hansberry solves a problem in a world without window functions:

When you group data in Power Query, you are able to do a number of aggregations, like count, sum, min, max, etc. The problem is each of those aggregations only operate on that column. Let me show you what I mean. Suppose you have this data:

You need to return a table to the DAX model (or an Excel table) that only has one record per product, and you want the latest date, and the sales person associated with that date. So for product A, it is June 18, 2020 and Suresh, for B it is June 14, 2020 and Ana, and so on.

In the SQL world, this is where a combination of common table expression and ROW_NUMBER() could get the top record partitioned by product ordered by date descending. Click through for Ed’s DAX-based solutions.

Comments closed