Press "Enter" to skip to content

Author: Kevin Feasel

Getting Current File Name In SSIS

Bill Fellows shows how to get the currently processing file name in SSIS:

So, as much as I like the built in solution, my pattern is to use a Derived Column to inject the file name into the Data Flow. I have a variable called CurrentFileName in all my packages. That contains the design-time path for my Flat File Connection Manager (or Excel). My Connection Manager will then have the ConnectionString/ExcelFilePath property assigned to be @[User::CurrentFileName]. This positions me for success because all I need to do is ensure that whatever mechanism I am using to determine my source file correctly populates that variable. In this post, a ForEach File Enumerator will handle that.

Within my Data Flow Task, I will add a Derived Column Transformation that adds my package variable into the data flow as a new column. Here, I am specifying it will be of data type DT_STR with a length of 130.

He follows up with some Biml to drive home the point.

Comments closed

Pester For Configuration Checks

Rob Sewell talks about SQL Server configuration testing using Pester, a Powershell test framework:

Occasionally, for reasons I haven’t explored I had to test against the value property of the returned object

          It "The Full User Database Backup should be scheduled Weekly $OlaUserFullSchedule" {
$Return.OlaUserFullSchedule.value | Should Be $OlaUserFullSchedule
}

I wanted to be able to run the tests against environments or groups of servers with different default values so I parameterised the Test Results as well and then the logical step was to turn it into a function and then I could do some parameter splatting. This also gives me the opportunity to show all of the things that I am currently giving parameters to the test for

This is a nice walkthrough of Rob’s contribution, available on GitHub.

Comments closed

NUMA Node Count

Denny Cherry shows us how to find the number of NUMA nodes a SQL Server has:

The next way to see what your NUMA configuration is, is to look at the ERRORLOG file. Every time SQL Server is started it will log the NUMA configuration towards the beginning of the ERRORLOG file. That line will look something like this example.

Server Node configuration: node 0: CPU mask: 0x00000000000000ff:0 Active CPU mask: 0x00000000000000ff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

Denny shows four different ways of getting this answer, and then provides some general advice on the maximum number of nodes you probably want.

Comments closed

Designing For Usability In Power BI

Avi Singh walks through a usability scenario in Power Pivot and Power BI:

Here is how we can go about combining the Products and the newly created SubCategory table for a more usable and elegant solution. I have highlighted the key steps, you can examine the solution file to go step by step in the Power BI Query.
Note: Don’t be scared of the M code, 99% of these steps were generated using the Query ribbon – I have included those screenshots as well.

When developing BI models, it’s important to keep things as simple as possible.  Your desired end user likely does not have the necessary skill level to wade through normalized table designs, so make it easy for them to get their jobs done.

Comments closed

Colleen Morrow shows how to automate Service Broker message handling:

The other way to automate this stored procedure is by attaching it to the queue itself so that it Service Broker directly executes the procedure in a process called activation.  With activation Service Broker starts an application whenever there is work to do (e.g. when messages are in the queue).  When we enable activation for a queue, SQL Server spawns a queue monitor process to keep an eye on the queue and check every few seconds to see whether there are messages available for processing.  There are 2 types of activation:

  • Internal activation – A stored procedure like the one above is attached to the queue to process messages automatically as they arrive.

  • External activation – An external program is notified that messages need to be processed.

Colleen focuses on internal activation, which calls a stored procedure for each message received.  Check her post out.

Comments closed

Spark Optimizations

Over at the DZone blog, we learn how to use Distribute By and Cluster By to optimize Spark performance:

Your DataFrame is skewed if most of its rows are located on a small number of partitions, while the majority of the partitions remain empty. You really should avoid such a situation. Why? This makes your application virtually not parallel – most of the time you will be waiting for a single task to finish. Even worse, in some cases you can run out of memory on some executors or cause an excessive spill of data to a disk. All of this can happen if your data is not evenly distributed.

To deal with the skew, you can repartition your data using distribute by. For the expression to partition by, choose something that you know will evenly distribute the data. You can even use the primary key of the DataFrame!

It’s interesting to see how cluster by, distribute by, and sort by can have such different performance consequences.

Comments closed

Kafka 0.10

Kafka 0.10 is now available:

Kafka Streams: Kafka Streams was introduced as part of thetech preview release of the Confluent Platform few months ago and is now available through Apache Kafka 0.10.0.0. Kafka Streams is a library that turns Apache Kafka into a full featured, modern stream processing system. Kafka Streams includes a high level language for describing common stream operations (such as joining, filtering, and aggregating records), allowing developers to quickly develop powerful streaming applications. Kafka Streams offers a true event-at-a-time processing model, handles out-of-order data, allows stateful and stateless processing and can easily be deployed on many different systems— Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.

There are some nice improvements in this latest version of Kafka.

Comments closed

Memory Is The Key

Kathi Kellenberger discusses the importance of RAM to a SQL Server instance:

In order for SQL Server to read and update data, the data must be in the buffer. SQL Server does not work directly with the data in the files on disk. Once the pages of data are in the buffer, they can be used for multiple queries. This means that the data doesn’t have to be retrieved from disk every time it’s needed, thereby decreasing the amount of I/O work required.

You may have seen this yourself when selecting all the rows of a large table twice. The second time, the query can run much faster because the data does not have to be copied from the disk to the buffer. If you run another query from a different large table, it may cause the pages from the first table to be removed to make room. If there is not enough memory, pages will have to be read from disk more frequently causing your queries to be slow.

Even with extremely fast SSDs and flash storage arrays, RAM is still typically an order of magnitude faster, so having enough RAM and using it wisely is critical to a well-functioning SQL Server instance.

Comments closed

Feather

David Smith discusses Feather:

Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the saveRDS/loadRDS function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

Given the big speedup in read time, I can see this file format being rather useful.  I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file.  So instead, it’d end up closer to something like Avro or Parquet:  formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.

Comments closed

Looking At R Services

Gail Shaw reviews R support in SQL Server 2016:

It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.

I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like

I agree with the sp_execute_external_script mess.  It’s the worst of dynamic SQL combined with multiple languages (T-SQL for the stored procedure & R for the contents, but taking care to deal with T-SQL single-quoting).  Still, even with these issues, I think this will be a very useful tool for data analysts, particularly when dealing with rather large data sets on warehouse servers with plenty of RAM.

Comments closed