Cloud – Page 99 – Curated SQL

Tips for Using Azure Table Storage

Published 2020-10-22 by Kevin Feasel

Adrian Hills takes us through using Azure Table Storage:

Azure Table Storage is a NoSQL key-value PaaS data store that can be a great option for highly scalable, highly available systems. It supports storing petabytes of data and a flexible data schema, meaning different entities in the same table can have different schemas. References to NoSQL databases having “flexible schema” or being “schema-less” can give the impression that database schema design is a thing of the past and that you can bypass it and focus more on the application code. The reality is, even in this NoSQL world, schema design is very important and if you don’t give it due care and attention, then it can come back to bite you.
If you have a RDBMS background and are new to Azure Table Storage, it’s common to find yourself “thinking in SQL” and trying to solve database modeling requirements with a SQL approach before then trying to translate that to a key-value mindset. In this blog post, I’ll cover some of the fundamentals of Azure Table Storage and dive into some common questions you might find yourself asking about Azure Table Storage. Where code samples or references are applicable in this blog post, we’ll be focusing on .NET and using the Azure SDK (specifically relating to the Microsoft.Azure.Cosmos.Table nuget package).

Read on for the full story.

Comments closed

Self-Service with Azure Synapse Analytics

Published 2020-10-21 by Kevin Feasel

Paul Andrew lays out an interesting idea:

I’ve been playing around with Azure Synapse Analytics for a while now exploring the preview features and trying to find a meaningful use case for the ‘single pane of glass’ capabilities. In this post I’m exploring one possible option/idea for creating a very simple self service approach to dataset ingestion and consumption. Full disclosure, the below is far from technical perfection for lots of reasons, I mainly wanted to put something out there as an idea and use it to maybe start a conversation.

Click through to see Paul’s take on the matter.

Comments closed

The Raw Facts on Azure SQL DB Serverless

Published 2020-10-21 by Kevin Feasel

Taiob Ali gives us a briefing summary on Azure SQL Database Serverless:

Occasionally, load balancing automatically occurs if the machine cannot satisfy resource demand within a few minutes. For example, if the resource demand is 4 vCores, but only 2 vCores are available, it may take up to a few minutes to load balance before 4 vCores are provided. The database remains online during load balancing except for a brief period at the end of the operation when connections are dropped.

Click through for more points along these lines.

Comments closed

Azure Site-to-Site VPN Blocking Certain Traffic

Published 2020-10-20 by Kevin Feasel

Denny Cherry diagnoses a network configuration issue:

I ran across an interesting a couple of weeks ago when working with a client. The client has several subsidiaries each with their own vNet. The client had a site to site VPN been the Azure vNets. All traffic was successfully crossing the Azure Site to Site VPN as expected. The sticking point was that a software licensing server running in one of the subsidiaries Azure infrastructure configurations. The software licensing software simply wasn’t working.

Click through to learn why.

Comments closed

Optical Character Recognition with Tesseract and Databricks

Published 2020-10-19 by Kevin Feasel

Alex Aleksandrov takes a look at optical character recognition with the Tesseract library:

The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1, link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.

Click through for instructions.

Comments closed

Indexing S3 Data with NiFi and CDP Data Hubs

Published 2020-10-16 by Kevin Feasel

Eva Nahari, et al, walk us through text indexing of S3 data with Solar, NiFi, and Cloudera Data Platform:

Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc).
Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take.

Click through to see how well they do at that.

Comments closed

Durable Azure Functions and Azure Data Factory

Published 2020-10-16 by Kevin Feasel

Rayis Imayev wants to use Azure Functions with Azure Data Factory:

Ok, here is my problem: I have an Azure Data Factory (ADF) workflow that includes an Azure Function call to perform external operations and returns output result, which in return is used further down my ADF pipeline. My ADF workflow (1) depends on the output result of the Azure Function call; (2) plus a time efficiency of the Azure Function call is another factor to consider, if its time execution hits 230 seconds or more, ADF Azure Function will fail with a time-out error message and my workflow is screwed.

This gave Rayis the impetus to try out durable functions. Read on to see how that worked out.

Comments closed

Querying Multiple Data Sources in Azure Synapse Analytics

Published 2020-10-16 by Kevin Feasel

James Serra walks us through querying Data Lake Storage Gen2, Cosmos DB, and a table created in an Azure Synapse serverless Apache Spark pool:

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.

Click through to see how.

Comments closed

Spark Infer Schema vs ADF Get Metadata

Published 2020-10-15 by Kevin Feasel

Paul Andrew compares two techniques for retrieving metadata:

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.
Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone

Read on for a comparison of the two techniques.

Comments closed

MLOps with Azure Databricks and MLflow

Published 2020-10-14 by Kevin Feasel

Oliver Koernig walks us through some of the basics of MLOps using MLflow and Azure Databricks:

Most organizations today have a defined process to promote code (e.g. Java or Python) from development to QA/Test and production. Many are using Continuous Integration and/or Continuous Delivery (CI/CD) processes and oftentimes are using tools such as Azure DevOps or Jenkins to help with that process. Databricks has provided many resources to detail how the Databricks Unified Analytics Platform can be integrated with these tools (see Azure DevOps Integration, Jenkins Integration). In addition, there is a Databricks Labs project – CI/CD Templates – as well as a related blog post that provides automated templates for GitHub Actions and Azure DevOps, which makes the integration much easier and faster.
When it comes to machine learning, though, most organizations do not have the same kind of disciplined process in place.

Read on for a demonstration of the process.

Comments closed

Category: Cloud