The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub. Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.
What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!
There are a bunch of steps, but the end result is worth it.
For more information, read this blog: End to End Loan ChargeOff Prediction Built Using Azure HDInsight Spark Clusters and SQL Server 2016 R Service
We have published two solution templates deployable using two technology stacks for the above chargeoff scenario:-
Loan Chargeoff Prediction using SQL Server 2016 R Services – Using DSVM with SQL Server 2016 and Microsoft ML, this solution template walks through how to create and clean up a set of simulated data, use 5 different models to train, select the best performant model, perform scoring using the model and save the prediction results back to SQL Server. A PowerBI report connects to the prediction table and show interactive reports with the user on the chargeoff prediction.
Loan Chargeoff Prediction using HDInsight Spark Clusters – This solution demonstrates how to develop machine learning models for predicting loan chargeoff (including data processing, feature engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions is saved to a Hive table which could be visualized in Power BI.
These tend to be nice because they show you how the different pieces of the Azure stack tie together.
Some of the real world use cases which can be impacted due to the S3 eventual consistency model are:
Listing Files. Newly created files might not be visible for data processing. In Hive, Spark and MapReduce, this can lead to erroneous results from incomplete source data or failure to commit all intermediate results.
ETL Workflow. Systems like Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.
Existence-guarded path operations. Any action which fails if the destination path is present may see a deleted file in a listing, and so fail — even though the file has already been deleted.
Read on to see how S3Guard works and how to enable it in HDP 2.6.
We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:
And run the following action for the worker node:
Please consult Customize Linux-based HDInsight clusters using Script Action for more details.
Click through for the full process.
The recommended method for working with Azure is always PowerShell. The Azure portal and SSMS are tools there for your convenience but they do not scale well. If you have multiple databases to migrate, potentially from multiple servers, using PowerShell will be much more efficient. Scripting your Azure work makes it repeatable and works towards the Infrastructure as Code concept.
In this demonstration, the below steps will be used.
Export the bacpac file to a local directory with sqlpackage.exe.
Copy the bacpac to Azure Blob Storage with AzCopy.exe
Use the PowerShell AzureRM module and cmdlets to create an Azure SQL Database from the bacpac file.
Derik shows the point-and-click way as well as the Powershell way.
Simon Whiteley continues his Polybase on Azure SQL Data Warehouse series. First, he covers data loading patterns:
That’s enough about data loading for now, there’s another major use case for Polybase that we haven’t yet discussed. Many data processing solutions have a huge, unwieldy overnight batch job that performs aggregates, lookups, analytics and various other calculations.
However, it is often the case that this is not timely enough for many business requirements. This is where Polybase can help.
If we have an External Table over the newest files, this will read these new records at query time. We can write a view that combines the External Table with our batch-produced table. This will obviously go a little slower than usual, given it has to read data from flat files each time, however the results returned will be up to date.
In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.
General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we’re interested in here is the core data movement itself.
Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or “Create Table as Select” statement.
This is a pair of interesting posts from Simon.
Notice the region to upload – If you are using a work machine I would suggest getting authorisation. The great thing here is that this is GDPR compliant.
Once ready hit the upload button, it goes through 3 phases. Upload, Analysis and a recommendation.
It sends your dump files to an external service, which is important enough to point out. If you want more details on the product, Rony Chatterjee has a FAQ.
Like most wizards in the world, this one begins with a welcome page. One important piece of information that is provided, however, is the tip about using the AWS Schema Conversion Tool. The conversion tool is most useful for heterogeneous migrations, such as Oracle to RDS Microsoft SQL Server. This demonstration is using homogeneous data platforms; therefore, it is not needed.
It does seem pretty easy to do.
Download the latest client libraries for Analysis Services. This needs to be done on your local machine so you can then copy these files to your Azure Function App.
After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:
C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLL
C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLL
This step is important because the documentation in Azure references the 130 assemblies, which will not work. You need the assemblies in 140 otherwise you’ll get errors.
Dustin walks through the whole process of setting up an Azure Function step by step.
In many cases, you can easily provision resources in the web-based Azure portal. If you’re never going to repeat the deployment process, then by all means use the interface in the Azure portal. It doesn’t always make sense to invest the time in automated deployments. However, ARM templates are really helpful if you’re interested in achieving repeatability, improving accuracy, achieving consistency between environments, and reducing manual effort.
Use ARM templates if you intend to:
Include the configuration of Azure resources in source control (“Infrastructure as Code”), and/or
Repeat the deployment process numerous times, and/or
Automate deployments, and/or
Employ continuous integration techniques, and/or
Utilize DevOps principles and practices, and/or
Repeatedly utilize testing infrastructure then de-provision it when finished
Melissa walks through an example of deploying a website with backing database, along with various configuration changes.