Azure Data Factory v2 And Decompression

Kevin Feasel

2018-04-23

Bugs, Cloud, ETL

Ben Jarvis notes a file naming bug with Azure Data Factory v2 when decompressing files:

ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

Click through for more details and be sure to vote on his Azure Feedback bug if this affects you.

Five Books For Learning Kafka

Kevin Feasel

2018-04-23

Hadoop

Data Flair has a guide to five books to help you learn Apache Kafka:

The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka.

They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. Through this including the replication protocol, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, the controller, and the storage layer. However, even if you are new to Apache Kafka as the application architect, developer, or production engineer, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

I haven’t read any of them yet, but a couple look interesting enough to add to my to-read list.

Query Store Cleanup Can Be Blocked

Kendra Little shows that you can block Query Store cleanup:

This is an isolated test system, so I went to clean out Query Store as a reset. I didn’t need any of the old information in there, so I ran:

  • ALTER DATABASE BabbyNames SET QUERY_STORE CLEAR ALL;
  • GO

I was surprised when I didn’t see this complete very quickly, as it normally does.

Click through to see how Kendra diagnoses the issue.

The Non-Blocking Segment Operator

Hugo Kornelius notes a documentation bug with the Segment operator:

The Segment operator, like all operators, is described at the Books Online page mentioned above. Here is the description, quoted verbatim:

Segment is a physical and a logical operator. It divides the input set into segments based on the value of one or more columns. These columns are shown as arguments in the Segment operator. The operator then outputs one segment at a time.

Looking at the properties of the Segment operator, we do indeed see the argument mentioned in this description, in the Group By property (highlighted in the screenshot). So this operator reads the data returned by the Index Scan (sorted by TerritoryID, which is required for it to work; this is why the Index Scan operator is ordered to perform an ordered scan), and divides it into segments based on this column. In other words, this operator is a direct implementation of the PARTITION BY spefication. Every segment returned by the operator is what we would call a partition for the ROW_NUMBER() function in the T-SQL query. And this enables the Sequence Project operator to reset its counters and start at 1 for every new segment / partition.

Read on to understand the issue and see Hugo’s proof.

Type Information Change In Export-CSV Cmdlet

Max Trinidad notes that a default parameter in the Export-Csv cmdlet has flipped between Powershell on Windows and Powershell Core 6:

For a long time, in Windows PowerShell, we had to add the parameter “-NoTypeInformation“, so the “#TYPE …” line on the first row of the *CSV would not be included.

So, in Windows PowerShell executing the command without the “-NoTypeInformation” parameter, will produce the following result:

Now, using the same command in PowerShell Core without the “-NoTypeInformation” parameter, will produce a different result:

This is a better default, but I think it’s going to burn some people who have scripts pre-built expecting to clear out that first line.

Reading SQL Server Error Logs

Jana Sattainathan has a procedure which makes it easier to read the error logs in SQL Server:

xp_ReadErrorLog has some limitations

  • Reads only the specified error log whose ArchiveNumber is specified
  • Shows only the rows with matching string (not adjacent context info rows)

The first bullet is obvious in that we cannot read ALL the logs to look for something more holistic and meaningful with all the information we have. This MSSQLTips post does describe a method to loop through.

Click through for Jana’s stored procedure code.

Finding Queries Which Drive The Missing Index DMV

Daniel Janik shows how you can find which queries are causing you pain due to missing indexes:

Missing indexes are an important part of the indexing strategy. I usually start with sys.dm_db_index_usage_stats to find both inefficient and unused indexes and then supplement with missing indexes.

The missing index DMVs are great but they’ve always been missing something.

What are they missing you ask? They currently tell you what table they are for but not what query. How do I know if the queries that sponsored this missing index are business critical or not? Wouldn’t it be nice to know what statements caused this “missing index” to appear?

Read on to learn how to do this.

Creating Azure VMs Using Powershell: Laying The Groundwork

Robert Cain has part one of a two-part series on creating VMs in Azure using Powershell:

Creating a virtual machine is great, but it won’t be of much use unless it can communicate outside of itself. That’s where virtual networking comes in. To setup a virtual network, often abbreviated vnet, you need to accomplish three things. First is the creation of the virtual network itself. After the network is created, you need to define a security group for it. In essence, the security group defines a firewall. In the process of creating it, the PSAzure module automatically creates firewall rules that allow HTTP and RDP (Remote Desktop Protocol) traffic through the firewall. There are functions in PSAzure to create security groups at a lower level, allowing one to create alternate rules. This example will demonstrate the most common options.

The final step is to create a virtual NIC, or Network Interface Card. The NIC will form the bridge between the virtual network and the virtual machine, much like a physical network card allows a physical computer to connect to a real network. First off, a few variables are assigned. These will hold names for the security group, network and subnet names. The network addresses for the main network and subnet are also placed into into variables. Finally, a name is assigned to the NIC.

Check it out, especially if you build a lot of VMs in Azure.

Categories

April 2018
MTWTFSS
« Mar May »
 1
2345678
9101112131415
16171819202122
23242526272829
30