This first step when using T-SQL to read Extended Files that are stored in an Azure Storage Account is to create a database credential. Of course the credential will provide essential security information to connect to the Azure Storage Account. This first data point you will need is the URL to a blog storage container in you storage account. If you look below, you can see where you would place your storage account name and the blob storage container name.
Dave gives us the grand tour of the configuration process, including where things differ between on-prem SQL Server and Azure SQL Database (which is quite a bit)
This is best used for those single databases that are ever changing with unpredictable patterns. With the concept of being billed per second (based on the vcores used) rather than per hour means that pricing can become more granular especially now with auto-pause becoming possible. The auto-pause delay defines the period of time the database must be inactive before it is automatically paused (only charged for storage). You should only use this if you can afford some delay in compute warm-up after idle usage periods, otherwise it is best to stick with provisioned compute tiers ( classic tiers).
I could see this being useful for dev or test databases, or maybe a personal site with heavy external caching.
Managed Instance has memory that is proportional to the number of cores. As an example, in Gen5 architecture you have 5.1GB of memory per vCore, meaning that 8-core instance will have 41GB memory. Periodically you should check is this amount of memory good for your workload.
Do not monitor does the Managed Instance use ~100% of available memory. Some people believe that this is an issue (like hitting 100% CPU), but this is expected and desired behavior. Managed Instance should use as much as possible memory to cache the pages from disk into the buffer pool. The only case where you will not see near 100% usage of memory is the case where you have the databases much smaller that the available memory size so all of them can fit into the memory.
The spoiler version is that it’s the same process as on-prem.
Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?
To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?
Read on to see how they both work.
To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.
In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.
We went a slightly different route for auto-termination, killing after a fixed number of hours.
What this is going to do is create an Azure Container Instance Group with one container it in, running SQL Server 2019 CTP 2.5. It’ll be publicly exposed to the internet on port 1433 (I’ll cover fixing that in a future post) so we’ll get a public IP that we can use to connect to.
Notice that the location and resource_group_name are set using variables that retrieve the values of the resource group are going to create.
Cool! We are ready to go!
Fun stuff, and Andrew promises more.
Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
Read on for additional options and James’s recommendations.
CosmosDB Emulator is a must have tool if you develop applications for Azure CosmosDB. Also, it’s a great tool to have if you like to learn about Azure CosmosDB but you have limited access to Azure for any reason. Azure CosmosDB team constantly works to make all available tools better including the emulator. Currently, emulator supports SQL, Cassandra, MongoDB, Gremlin and Table API. Data Explorer feature supports only SQL API for now. Emulator’simplementation is different than the service and you should not use it for stress testing, you can not test global replication or latency for read and writes.
It’s also useful for testing scenarios where you want some level of integration testing but don’t want to rely on an external service.
Microsoft Graph is a comprehensive framework for accessing data in various online Microsoft services, including Azure Active Directory (AAD), Office 365, OneDrive, Teams, and more. AzureGraph is an R package that provides a simple R6-based interface to the Graph REST API, and is the companion package to AzureRMR and AzureAuth.
Currently, AzureGraph aims to provide an R interface only to the AAD part, with a view to supporting R interoperability with Azure: registered apps and service principals, users and groups. Like AzureRMR, it could potentially be extended to support other services.
Just to clarify, this is like Facebook Graph API for Azure components, not a graph database that you can store your own data in.
A user recently asked me a question on my previous blog post (Setting Variables in Azure Data Factory Pipelines) about possibility extracting the first element of a variable if this variable is set of elements (array).
So as a spoiler alert, before writing a blog post and adding a bit more clarity to the existing Microsoft ADF documentation, here is a quick answer to this question.
You’ll have to click through even for the quick answer.