Category: Cloud

Data Lakehouse Cleanrooms in Databricks

Published 2022-06-30 by Kevin Feasel

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.
With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed

Triggering a Power BI Dataset Refresh from Synapse

Published 2022-06-29 by Kevin Feasel

Nick Edwards updates a dataset:

Login to powerbi.com and in the top right hand corner locate “Settings” and then “Admin portal”
Under “Tenant settings” locate “Developer Settings” and then “Allow service principles to user Power BI APIs”.
Set this service to “Enabled” using the toggle. Next under the heading “Apply to:” select “Specific security groups (Recommended)”. Next add the newly created security group “AzureSynapsePowerBIIntegration” and click apply.

Click through for the full process.

Comments closed

SharePoint Lists Showing 100 Items in Logic Apps

Published 2022-06-28 by Kevin Feasel

Koen Verbeeck needs more than 100 results:

I was reading a SharePoint List using the “Get Items” activity in an Azure Logic App. I explain how you can create such a Logic App in the blog post Reading a SharePoint List with Azure Logic App.
It all worked fine for a while, but recently the list grew larger than 100 items. Suddenly, I started getting complaints that some items didn’t make it into the data warehouse. What’s going on? I ran the Logic App and I could see only 100 items were inserted into the SQL Server table:

Read on to see how you can bump that number past 100.

Comments closed

Restoring SQL Managed Instance Backups to SQL Server 2022

Published 2022-06-28 by Kevin Feasel

Mladen Andzic has a preview around how we can take a Managed Instance backup and go on-premises:

Restoring a backup file is the easiest way to copy a SQL Server database to another instance. It allows you to create a copy of your production database for easier troubleshooting or debugging of an issue, to provide a copy of a database to your end users or eligible third parties, or as a light-weight business continuity/disaster recovery solution to restore functionality on another instance of SQL Server. These are just a few use cases, and the list is much longer and there are some very inventive ways of using backup-restore in the wild.
This article explains the challenges of cross-release restore to an older version of SQL engine and announces the private preview of a capability to restore a backup of a database taken from Azure SQL Managed Instance to instance of SQL Server 2022.

So much SQL Server functionality has been built with the idea of getting you from on-premises into the cloud (specifically Azure) but it’s good to see them spend some development effort on the entirely reasonable and realistic scenario that Azure is not the best choice for a company and there are many such companies still willing to throw money at Microsoft for a good product.

Comments closed

Software Engineering Practices for Notebooks

Published 2022-06-27 by Kevin Feasel

Rafi Kurlansik and Austin Ford explain how to get the most out of notebooks, using Databricks as an example:

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,
1. A report that runs regularly based on newer data and evolving business logic.
2. An ETL pipeline that needs to run on a regular schedule, or continuously.
3. A machine learning model that must be re-trained when new data arrives.
Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

Read on for several tips and recommendations.

Comments closed

Stopping Azure Kubernetes Service Nodes

Published 2022-06-27 by Kevin Feasel

Andrew Pruski wants to shut the whole thing down:

A while back I wrote a post on Adjusting Pod Eviction Timings in Kubernetes. To test the changes made in that post I had to shut down nodes in an Azure Kubernetes Service cluster.
This can be done easily in the Azure portal: –
However I did a presentation recently and didn’t want to have to keep jumping into the portal from VS Code…so I wanted to be able to shut down the nodes in code.
So here’s how to use the azure-cli to shut down a node in an Azure Kubernetes Service cluster.

Read on to see how but also read Andrew’s warning / disclaimer so you don’t mess anything up in a production environment.

Comments closed

PHI De-Identification in Databricks with NLP

Published 2022-06-24 by Kevin Feasel

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.
To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Connecting to Azure SQL DB over VPN

Published 2022-06-24 by Kevin Feasel

Reitse Eskens has some routing issues:

To make sure the on-premises connection uses the VPN and the private endpoint, we need to make sure the on-premises DNS (it’s always DNS) recognizes the traffic and redirects it to the VPN connection. But whatever we tried on the firewall, the traffic kept going the wrong way. It did have something to do with the on-premises DNS setup in the end.
When we tried to connect to the Azure SQL instance on IP-address, it threw an error because the instance wasn’t found. You can only connect to it with the FQDN (dbname.database.windows.net)

Click through to see what the problem was and how Reitse solved it.

Comments closed

Git Native Support for Databricks Workflows

Published 2022-06-23 by Kevin Feasel

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Delta Live Tables and Power BI Data Modeling

Published 2022-06-21 by Kevin Feasel

Tahir Fayyaz goes from Delta Lake to Power BI:

To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:
– How to easily develop and manage the data model’s transformation code.
– How to run and scale data pipelines for the model as data volumes grow.
– How to keep all the Delta Lake tables updated as new data arrives.
– How to view the lineage for all tables as the model gets more complex.
– How to actively stop data quality issues that result in incorrect reports.

Read on for recommendations, a couple architectural diagrams, and some sample code.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31