Public endpoint, ability to connect to Azure SQL Database Managed Instance from Internet, without VPN has reached global availability today. The release of this feature will help support many new integration scenarios.
The public endpoint for Managed Instance can today be enabled/disabled via PowerShell script. The support for Azure portal will be coming within the next week or so as soon as all updates are rolled out.
Click through to learn how to enable it with Powershell.
In any Introduction level CosmosDB talk, Presenter will suggest you pay more attention to the Partitioning part of the talk. Microsoft wants you to create great solution by using Azure CosmosDB and there are a lot of resources out there for developers. The challenge in Azure CosmosDB is, you don’t need a DBA in CosmosDB and developers may not pay attention to details like Partitioning when they create databases in Azure CosmosDB.
It is crucial to pick the right partition key for your databases in CosmosDB simply because you cannot repartition your databases. If you pick a wrong partition key for your data model, it will be simply too late to change it later. You might have good amount of data in your databases, that means you spent good amount of RU to insert this data and only way to fix this problem will be start from scratch and upload all data back again to CosmosDB with the right partition key.
Read on for a couple of analogies and why this is so important.
I wrote it because AzCopy was weak and inconsistent. It was fragile, needing constant attention and monitoring in case a journalling file got stuck. Also, AzCopy didn’t keep files in sync. If a file was deleted locally (as part of a cleanup to delete old backups), AzCopy was unable to delete files remotely, so it was messy to maintain files in Blob Storage containers. The uploader was written to keep files in sync, and not have to fuss with AzCopy.
The real value of this tool though, is being able to recover the latest backup files (full, differential and transaction logs where available) which are needed to recover from a catastrophic failure. Without any knowledge of the backups, just knowing the database name, it can parse the list of files in Azure, download the necessary ones to recover, and build a T-SQL script to restore them. Literally all you need to do is run the downloader, then run the restore script.
Randolph talks about how the state of AzCopy has changed and offers up some new guidance as well as tooling updates.
Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post. The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform. In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.
Azure Databricks is not a Microsoft product. It is owned and managed by the company Databricks and available in Azure and AWS. However, Databricks is a “first party offering” in Azure. This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products. You can read more about Azure Databricks here, hereand here.
Click through for a demonstration of the product.
Azure Cosmos DB Change Feed exposes Cosmos DB Logs to outside of CosmosDB. CosmosDB notifies you immediately when there is any change in your database. It supports all Inserts and Updates, Delete will be available soon. You can always use soft delete to catch delete events if you need to.
By knowing what is changed in your database, you can trigger all kind of events and you can make your application work very smart. SQL Server has similar functionality but like many other features Log shipping is usually blocked by DBAs or the company policies. In CosmosDB, you don’t need to do anything to enable Change Feed feature! It’s already enabled, all you need to do is to configure it. Easiest way to catch change feed events is Azure Functions.
When I hear someone describe the change feed, I immediately imagine it as a Kafka topic.
We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.
The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.
It looks pretty good from my vantage point.
When Azure SQL Data Warehouse was chosen to implement a multi-dimensional data warehouse, it may have seemed like the ideal choice. Why? because it was plain to see: keywords: “SQL”, “Warehouse”. However, no, SQL Data Warehouse is ideal only when you have data loads that are quite high, not when it is only several 100GBs. Armed with a few more reasons as to why not (A good reference for choosing Azure SQL Data Warehouse), I had confronted them. But the rebuke then was that they did get good enough performance, and that cost wasn’t a problem. Until of course a few months later when complex queries started hitting the system, and despite being able to afford that cost, the value of paying that amount did not seem worth it.
Having a good architectural understanding of the Azure or AWS platform—even if you aren’t deeply familiar with all of the tools—can help avoid these types of problems.
To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.
With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane.
This is a good article if the defaults won’t get past corporate security.
This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.
The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.
Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.
The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.
To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.
This isn’t quite a step-by-step guide but does spur on ideas.