Press "Enter" to skip to content

Author: Kevin Feasel

MLFlow on Databricks Community Edition

Jules Damji and Siddharth Murching have an interesting announcement:

Today, we are excited to extend Databricks Community Edition with hosted MLflow for free, as part of our ongoing commitment to help developers learn about machine learning lifecycle. With the Community Edition, you can try tutorials that demonstrate how to track results and experiments as you build machine learning models—a crucial stage in the machine learning model’s development lifecycle.

MLflow is an open-source platform for the machine learning lifecycle with four components: MLflow TrackingMLflow ProjectsMLflow Models, and MLflow Registry. MLflow is now included in Databricks Community Edition, meaning that you can utilize its Tracking and Model APIs within a notebook or from your laptop just as easily as you would with managed MLflow in Databricks Enterprise Edition.

I like showing off Databricks Community Edition, and I’m glad to see them extend it a bit.

Comments closed

Streaming ETL of Rail Data with Kafka

Robin Moffatt has an interesting architecture and implementation for Kafka:

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. With tools like KSQL and Kafka Connect, the concept of streaming ETL is made accessible to a much wider audience of developers and data engineers. The platform shown in this article is built using just SQL and JSON configuration files—not a scrap of Java code in sight.

The code is also available in a GitHub repo.

Comments closed

Determining Your Isolation Level

Bob Pusateri explains different transaction isolation levels and where you can figure out your current settings:

Isolation level settings for the entire database exist only as a default. All connections created will inherit the database’s isolation level. Furthermore, a database only has two possible isolation levels:
– read committed
– read committed snapshot (also known as “RCSI”)

There are four additional isolation levels (read uncommitted, repeatable read, serializable, snapshot) which can only be selected at the connection or statement level. In SQL Server and Azure SQL Database Managed Instance, the default database isolation level is read committed. In Azure SQL Database, the default is read committed snapshot.

Read the whole thing.

Comments closed

Upgrading Servers in an Availability Group

Thomas Rushton has a checklist for upgrading servers connected by an Availability Group:

This is a checklist put together and followed for an upgrade of a pair of physical SQL Server 2012 servers which hosted a single availability group of several terabytes of data with minimal downtime.

The availability group was configured with synchronous commit and automatic failover.

Minimizing downtime here is great, but it’s not automatic: you still need to do work on your end to get this right.

Comments closed

Making the SQL Server 2019 Edition Cut

Brent Ozar noticed an update to Microsoft’s “what’s in which edition” document:

The most important stuff:
– Standard Edition is still capped at 128GB RAM.
– Accelerated Database Recovery is in Standard Edition.
– Automatic tuning, batch mode for row store, adaptive memory grants, adaptive joins, and memory-optimized TempDB are Enterprise Edition only.

Click through for more of Brent’s thoughts. Of interest to me is that PolyBase control nodes may now run in Standard Edition. It makes me wonder if they’ll cap that feature somehow, where you get N nodes maximum in a Standard Edition scale-out group, or if it’s uncapped like Enterprise Edition.

Comments closed

AGs and Redo Latency

David Fowler explains that even in synchronous mode, there can still be redo latency for Availability Groups:

There’s a misconception that if you have a synchronous replica, that replica has to commit any transactions that are passed to it by the primary before the primary can then commit those same transactions. This ensures that there is no possibility of data loss and that the data would always be available on the secondary.

With this misconception comes our problem.

Read on to understand the full issue.

Comments closed

Scala Views

Girish Bharti takes us through a performance-tuning technique in Scala:

We all know the power of lazy variables in Scala programming. If you are developing the application with huge data then you must have worked with the Scala collections. Some mostly used collections are List, Seq, Vector, etc. Similarly, you must be aware of the power of Streams. The streams are a very powerful tool for handling the infinite flow of data and streams are powerful because of there lazy transformations. As we know most of the Scala collections are strict so applying an operation on immutable collections creates a new collection. The size of the collection can be huge in the big data world. So, what if you have to apply a lot of transformations to the collection? Is there a way to handle collections in a lazy way? What if you can find a way to apply operations on your usual collections lazily? In this blog, we will be talking about the Scala views and how to use them.

Read the whole thing.

Comments closed

Delta Lake to Become an Open Standard

Michael Armbrust and Reynold Xin have exciting news about Delta Lake:

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.

Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

Read on to see what this means for Delta Lake.

Comments closed

Benchmarking JSON Query Times

Silvano Coriani compares different options for loading and querying JSON data in Azure SQL Database:

Storing and retrieving data from JSON fragments is a common need in many application scenarios, like IoT solutions or microservice-based architectures. These fragments can be persisted in a variety of data stores, from blob or file shares, to relational and non-relational databases, and there’s a long standing debate in the industry on what’s the database technology that fits “better” for this task.
 
Azure SQL Database offers several options for parsing, transforming and querying JSON data, and this article doesn’t pretend to provide a definitive answer to that debate, but rather to explore these options for common scenarios like data loading and retrieving, and benchmarking results to provide a clear indication of how Azure SQL Database will perform manipulating JSON data.

Read on for the results.

Comments closed

Query Folding with Power BI Dataflows

Matthew Roche shares a few important points about Power BI dataflows and query folding:

In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.

Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.

Read on for those points.

Comments closed