In order to figure out why the straggler task took 15 minutes, we needed to catch it in the act. We reran the benchmark while monitoring the Spark UI, knowing that all but one of the tasks for the
saveoperation would complete within a few minutes. Sorting the tasks in that stage by the Status column, it did not take long for there to be only one task in the RUNNING state. We had found our skewed task and the IP address in the Host column pointed us at the executor experiencing the regression.
This is a nice case study of network troubleshooting, so of course there are Wireshark screenshots in it.
With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.
Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.
Read on for the solution.
I want to do a quick summary post of the many different types of Azure SQL Database available and I am not talking about elastic pools, VMs etc, more so the singleton type.
Azure SQL Database (I call normal mode) – A choice between the DTU model (Basic, Standard and Premium) and vCore (General Purpose and Business Critical). Within this space there are two different architecture types used by Microsoft under the covers.
As the product expands, we get more and more options, and Arun clarifies where each fits.
As it is late at night my brain wasn’t working as it should be but thought I’d put a quick blog out there to say that if you are on v1.11.5 and want to upgrade to >= v1.13.10 then you have to do this in a 2 stage process by upgrading to v1.12.8 first:
Fortunately, upgrading is pretty easy using the Azure command line or even the Azure portal.
Azure Data Factory’s Mapping Data Flows have built-in capabilities to handle complex ETL scenarios that include the ability to handle flexible schemas and changing source data. We call this capability “schema drift“.
When you build transformations that need to handle changing source schemas, your logic becomes tricky. In ADF, you can either build data flows that always look for patterns in the source and utilize generic transformation functions, or you can add a Derived Column that defines your flow’s canonical model.
Click through for the discussion and comparison. Schema drift has been the bane of Integration Services’s existence, so it’s good to see them tackling the idea in Azure Data Factory.
The SQL Engine we are running in Azure SQL Database is the very latest version of the same engine customers run on their own servers, except we manage and update it. To update SQL Server or the underlying infrastructure (i.e. Service Fabric or the operating system), we must stop the SQL Server process. If that process hosts the primary database replica, we move the replica to another machine (requiring a failover).
During failover, the database may be offline for a second and still meet our 99.995% SLA. However, failover of the primary replica impacts workload because it aborts in-flight queries and transactions. We built features such as resumable index (re)build and accelerated database recovery to address these situations, but not all running operations are automatically resumable. It may be expensive to restart complex queries or transactions that were aborted due to an upgrade. So even though failovers are quick, we want to avoid them.
Read on to see how they do it. There’s no on-prem analogue yet, though perhaps that will come in time.
A SQL Server data tools project is checked out of GitHub, built into a DacPac, four containerized SQL Server instances are spun up using clones of the ‘Seed’ docker volume. The DacPac is applied to a database running inside each container, which a tSQLt test is then executed against, finally, at the end very end the tSQLt results are aggregate and published.
This is an interesting approach to the problem of lengthy tests: run them on several separate machines concurrently.
For troubleshooting any issues with AWS DMS, it is necessary to have logs enabled. The DMS logs would typically give a better picture and helps find errors or warnings that would indicate the root cause of the failure. If the logs are not available there is nothing much you can do from a detailed troubleshooting analysis perspective. So basically next step is to turn on DMS logs and kick the job again and validate if the errors are captured in the logs.
If logs are not enabled, you need to set up a new task with logging enabled so if and when it errors out, you can take a look and troubleshoot the same.
I’ll save my full rant for another day, but I’m not that impressed with DMS. It could be a failing on my part, though.
Last year I wrote about how to upload data to Azure Blob Storage Archive Tier, and included a PowerShell script to do so. It’s something I use regularly, as I have hundreds of gigabytes of photos and videos safely (and cheaply!) stored in Azure Blob Storage using Archive Tier.
But read on for a recent announcement to make the process easier.
Using a Shared Access Signature (SAS) is usually the best way to control access rights to Azure storage resources (like a container for backups) without exposing the primary / secondary storage keys. It is based on a URI and this is what I want to look at today.
Read on to learn about the components which make up a Shared Access Signature.