Press "Enter" to skip to content

Month: December 2020

Using Scala in a Databricks Notebook

Tomaz Kastrun take a look at the original Spark language:

Let us start with Databricks datasets, that are available within every workspace and are here mainly for test purposes. This is nothing new; both Python and R come with sample datasets. For example the Iris dataset that is available with Base R engine and Seaborn Python package. Same goes with Databricks and sample dataset can be found in /databricks-datasets folder.

Click through for the walkthrough and introduction to Scala as it relates to Apache Spark.

Comments closed

More ETL Antipatterns

Tim Mitchell continues a series on ETL anti-patterns. First up is lazy metadata:

Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.

Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement.

Read on to see if you suffer from lazy metadata. If so, ask your doctor if Metatol is right for you.

Next, the lack of error handling logic:

Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.

This is infuriating in how common it is, even among people who know better. Err, please don’t look at my ETL processes…

Finally, Tim reminds us that ETL logic is source code:

In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.

However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.

I think much of this comes from SSIS building out giant XML files which were essentially binaries. BIML went a long way toward allowing us to treat ETL with SSIS as a proper language and store code in source control, but even if you’re hand-editing SSIS packages, I agree with Tim completely.

Comments closed

TF-IDF in .NET for Spark, Updated

Ed Elliott has been busy:

Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark.

In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .NET, we couldn’t actually use it. The necessary functionality is now available and so I am updating the post. To see the previous version go to: https://the.agilesql.club/2020/07/tf-idf-in-.net-for-apache-spark-using-spark-ml/

Read on for more information, as well as a call to action.

Comments closed

Apache Spark Basics in Azure Synapse Analytics

Euan Garden shows off some Apache Spark functionality in Azure Synapse Analytics:

Apache Spark has been a long-time favorite tool amongst data engineers and data scientists; it is well known for handling large scale data processing and complex machine learning workloads.

Azure Synapse Analytics offers a fully managed and integrated Apache Spark experience. By leveraging Apache Spark in Azure Synapse, you can benefit from integrated security, fully managed provisioning, and tight-coupling to other Azure services, such as SQL databases (dedicated and serverless), Azure Key Vault , ADLS Gen2, and Azure Blob Storage as well as fast starting, high performance compute instances.

Click through for the demo.

Comments closed

The Importance of Composite Models

Paul Turley lays out the significance of composite models in Power BI:

There are been many attempts by Microsoft and other vendors to create a data modelling architecture that provides for fast access to cached data, direct access to live data and scaled-out connections to established data models. Remember ROLAP and HOLAP storage in multidimensional cubes? These were great concepts with significant trade-off limitations. No other vendor has anything like this. Way back in the day, Microsoft jumped on the Ralph Kimball bandwagon to promote the idea that a company should have a “one version of the truth” exposed through their data warehouse and cubes or semantic data models. They met customer demand and gave us a BI tool that, in order to bring data together from multiple sources, makes it easy to create a lot of data silos. Arguably, there are design patterns to minimize data duplication but to use governed datasets, self-service report designers are limited to connecting to large, central models that might only be authored and managed by IT. This new feature can restore balance to the force and bring us back to “one version of the truth” again.

Read on for Paul’s early thoughts on the feature.

Comments closed

Small Multiples in Power BI

Chris Webb takes us through a new feature in Power BI:

While the long-awaited small multiples feature that previewed in the December 2020 release is an obvious boost to Power BI’s data visualisation capabilities, did you know that you can use it to improve report performance too?

Earlier this year I wrote blog posts showing how you can improve report performance by showing the same amount of data in fewer visuals (for example by replacing several cards with a single table) and how the number of visuals on a page affects report performance even if they aren’t displaying any data; several other people have written similar posts too. Small multiples are just another way you can replace several visuals with a single visual that displays the same data.

I liked this feature for the visualization improvements, but if you can throw in performance improvements as well, I’m sold.

Comments closed

Correlated Subqueries which Don’t

Daniel Hutmacher gives us an eye test:

The developer wrote this pretty little query to show us which accounts are up for review (which in our case means they have a “30” flag).

SELECT account, balance, 'For review' AS [status]
FROM #accounts WHERE account IN (SELECT account FROM #accountFlags WHERE flag=30) ORDER BY account;

Did you spot it?

I did, but in fairness, I’ve been burned enough times by this that I check for it.

Comments closed

Sync Logins between Availability Group Replicas

Taryn Pratt has a process:

Always On Availability Groups can support up to nine availability replicas, and while we don’t use anywhere near that many replicas in each of our clusters, we do have 2 replicas per cluster (3 servers total), with the replicas being used as a readable secondary.

Since we use readable secondaries in our environments, the application needs to connect to both the primary and the secondary servers with the same login. The catch is, logins don’t automatically sync across replicas. If the logins don’t sync, the application won’t connect to a secondary, which results in login failures.

Read on for one way to solve the problem.

Comments closed