Press "Enter" to skip to content

Category: Cloud

Summarizing Data & AI Summit Announcements

Zach Stagers hits the high notes:

One of the biggest cheers of the keynote was that Delta is being fully open sourced! Databricks continue to share their incredible work to help drive our industry forward. Delta already has wide adoption, but with the open sourced version now being levelled up to the same standard as the ‘proprietary’ one, this should help cement it as the default choice for lake-based storage.

There were some announcements of things to come with Delta too, such as a optimised deletes and updates by removing single rows instead of having to completely rewrite the file. It’ll be really interesting to see how this works, and just how much it boosts performance.

Read on for more notes on several big announcements.

Comments closed

Date-Time Binning in Cosmos DB

Hasan Savran bins some data:

I wrote about the Date_Bucket() function in SQL Server a couple weeks ago. Azure Cosmos DB team announced the same functionality with a different name DateTimeBin() function. It works exactly the same with the Date_Bucket() function of SQL Server.

     Cosmos DB version of the function has the same number of parameters. The order is different. All the datatime parameters must be in ISO 8601 format (YYYY-MM-DDThh:mm:ss.fffffffZ)

Read on to see how it works.

Comments closed

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed

Triggering a Power BI Dataset Refresh from Synapse

Nick Edwards updates a dataset:

Login to powerbi.com and in the top right hand corner locate “Settings” and then “Admin portal”

Under “Tenant settings” locate “Developer Settings” and then “Allow service principles to user Power BI APIs”.

Set this service to “Enabled” using the toggle. Next under the heading “Apply to:” select “Specific security groups (Recommended)”. Next add the newly created security group “AzureSynapsePowerBIIntegration” and click apply.

Click through for the full process.

Comments closed

SharePoint Lists Showing 100 Items in Logic Apps

Koen Verbeeck needs more than 100 results:

I was reading a SharePoint List using the “Get Items” activity in an Azure Logic App. I explain how you can create such a Logic App in the blog post Reading a SharePoint List with Azure Logic App.

It all worked fine for a while, but recently the list grew larger than 100 items. Suddenly, I started getting complaints that some items didn’t make it into the data warehouse. What’s going on? I ran the Logic App and I could see only 100 items were inserted into the SQL Server table:

Read on to see how you can bump that number past 100.

Comments closed

Restoring SQL Managed Instance Backups to SQL Server 2022

Mladen Andzic has a preview around how we can take a Managed Instance backup and go on-premises:

Restoring a backup file is the easiest way to copy a SQL Server database to another instance. It allows you to create a copy of your production database for easier troubleshooting or debugging of an issue, to provide a copy of a database to your end users or eligible third parties, or as a light-weight business continuity/disaster recovery solution to restore functionality on another instance of SQL Server. These are just a few use cases, and the list is much longer and there are some very inventive ways of using backup-restore in the wild.

This article explains the challenges of cross-release restore to an older version of SQL engine and announces the private preview of a capability to restore a backup of a database taken from Azure SQL Managed Instance to instance of SQL Server 2022. 

So much SQL Server functionality has been built with the idea of getting you from on-premises into the cloud (specifically Azure) but it’s good to see them spend some development effort on the entirely reasonable and realistic scenario that Azure is not the best choice for a company and there are many such companies still willing to throw money at Microsoft for a good product.

Comments closed

Software Engineering Practices for Notebooks

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

1. A report that runs regularly based on newer data and evolving business logic.

2. An ETL pipeline that needs to run on a regular schedule, or continuously.

3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

Stopping Azure Kubernetes Service Nodes

Andrew Pruski wants to shut the whole thing down:

A while back I wrote a post on Adjusting Pod Eviction Timings in Kubernetes. To test the changes made in that post I had to shut down nodes in an Azure Kubernetes Service cluster.

This can be done easily in the Azure portal: –

However I did a presentation recently and didn’t want to have to keep jumping into the portal from VS Code…so I wanted to be able to shut down the nodes in code.

So here’s how to use the azure-cli to shut down a node in an Azure Kubernetes Service cluster.

Read on to see how but also read Andrew’s warning / disclaimer so you don’t mess anything up in a production environment.

Comments closed

PHI De-Identification in Databricks with NLP

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Connecting to Azure SQL DB over VPN

Reitse Eskens has some routing issues:

To make sure the on-premises connection uses the VPN and the private endpoint, we need to make sure the on-premises DNS (it’s always DNS) recognizes the traffic and redirects it to the VPN connection. But whatever we tried on the firewall, the traffic kept going the wrong way. It did have something to do with the on-premises DNS setup in the end.

When we tried to connect to the Azure SQL instance on IP-address, it threw an error because the instance wasn’t found. You can only connect to it with the FQDN (dbname.database.windows.net)

Click through to see what the problem was and how Reitse solved it.

Comments closed