Press "Enter" to skip to content

Month: October 2018

Contrasting Integration Services And Pentaho Data Integration

Koen Verbeeck contrasts SQL Server Integration Services with Pentaho Data Integration:

For generating SSIS packages, you need to rely on Biml (much about that can be found on this blog or on the net), or older frameworks such as ezApi. Or you need 3rd party tools such as BimlStudio or TimeXtender. Using Biml means writing XML and .NET. Don’t get me wrong, I love Biml and I use it a lot in my SSIS projects.

But generating transformations in PDI is so much easier. First, you create a template (you create a transformation, but you leave certain fields empty, such as the source SQL statement and the destination table). Then you have another transformation reading metadata. This metadata is pushed to the template using the Metadata Injection Transformation. In this transformation, you point to the template and you map those empty fields to your metadata fields.

It’s interesting to see where each product stands out or falls flat compared to the other, and Koen’s comparison is definitely not a one-sided bout.

Comments closed

Replicating Solr Indexes

Nirmal Prabhu walks us through configuring replicated Solr instances:

Step 4: [Creating master Core]

First, we need to create a core for indexing the data. The Solr create command has the following options:

  • -c <name> — Name of the core or collection to create (required).
  • -d <confdir> — The configuration directory, useful in the SolrCloud mode.
  • -n <configName> — The configuration name. This defaults to the same name as the core or collection.
  • -p <port> — Port of a local Solr instance to send the create command to; by default the script tries to detect the port by looking for running Solr instances.
  • -s <shards> — Number of shards to split a collection into, default is 1.
  • -rf <replicas> — Number of copies of each document in the collection. The default is 1.

In this example, we will use the -c parameter for core name, -rf parameter for replication and -d parameter for the configuration directory.

Read on for step-by-step instructions.

Comments closed

Lazy Log Truncation

Paul Randal explains why Virtual Log Files might remain in status 2 even after they are cleared:

Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.

I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.

Read on for the answer.

Comments closed

Approved Powershell Verbs

Richard Siddaway on approved verbs in Powershell:

The other very useful set information are the synonyms for verbs that you shouldn’t use. For instance don’t use Append, Attach, Concatenate or Insert – use Add. Some of this information is contextual though as you shouldn’t use Pop or Out as a synonym for Exit BUT Pop is perfectly valid when removing an item off a stack (Pop-Location is the only cmdlet I know of that works in that way).

Read on for a link to the approved verbs list.

Comments closed

Going In-Depth On Powershell Arrays

Kevin Marquette has a tour de force on Powershell arrays:

When your array is a collection of string or integers (value types), sometimes you will want to update the values in the array as you enumerate them. Most of the iteration loops above use a variable in the loop that holds the value. If you update that variable, the original value in the array is not updated.

The exception to that statement is the for loop. If you are wanting to walk an array and update values inside it, then the for loop is what you are looking for.

    for ( $index = 0; $index -lt $data.count; $index++ )
    {
        $data[$index] = "Item: [{0}]" -f $data[$index]
    }

This examples takes a value by index, makes a few changes, and then uses that same index to assign it back.

This is a book chapter-length blog post full of good information.

Comments closed

Safely Dropping Databases

Bob Pusateri notes a little issue when it comes to dropping databases:

At a previous employer, we had a well-defined process when dropping databases for a client. It went something like this:

  1. Confirm in writing the databases on which servers/instances to be dropped
  2. Take a final full backup of databases
  3. Take databases offline
  4. Wait at least two weeks to make sure nothing breaks in the absence of this database
  5. Drop databases

This is a pretty good and safe method. If taking the database offline causes some unforeseen system to stop working, it can be very quickly brought back online in-place, instead of having to locate the backup and restore it. But it there’s just one problem…

Read on for that problem and its solution.

Comments closed

Generating Load For Kafka With JMeter

Anup Shirolkar shows us a way to use JMeter to generate load for Apache Kafka clusters:

The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina will be accumulating a lot of events which are to be processed to discover anomalies. The exact sequence of processing is still being prototyped at this point in time, but there is a solid requirement of a tool/mechanism to load the Kafka cluster with lots of data in a hurry.

The requirements pointed me in direction of looking for ‘Kafka Load Testing’. Firstly thinking of load testing, one tool comes to mind which is used very widely for load testing of Java based systems: ‘Jmeter’. Jmeter has rich toolset to perform various types of testing. It also comes with many advantages viz. Open source, easy to use, platform independent, distributed testing etc. I can use Jmeter and test its ability to perform cluster loading.

Read on for the demonstration.

Comments closed

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers:

We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.

Click through for more additions and demos.

Comments closed

Thinking About Font Sizes

Stephanie Evergreen shares some good information on font sizes:

Did you know that you regularly read type set in size 8, or even smaller? In printed materials, captions and less important information (think: photograph credits, newsletter headline subtext, magazine staff listings) are usually reduced to something between 7.5 to 9 points. We generally read that size type without much issue, like glasses. The reason why we can comfortably read those small sizes is because the designers chose an effective font that keeps its clarity and legibility when shrunk.

Designers don’t make the font that tiny to give you a headache. They do it to establish a font hierarchy. Our brains interpret the biggest size as the most important and the littlest size as the least important. So we can create a hierarchy of font sizes to structure our work and communicate even more clearly.

The font hierarchy is important, but so is picking a font which is clear at the sizes you want to use.

Comments closed

New DMF: dm_db_page_info

Pam Lahoud shows off a new Dynamic Management Function in SQL Server 2019:

The primary use case we had in mind when developing this function was troubleshooting page-related waits. Some of the key performance scenarios that require page details to diagnose are tempdb contentionlast page insert contention (also see this blog articlefor code samples) and page-level blocking. All of these scenarios present as either PAGELATCH or lock wait types in sys.dm_exec_query_stats with a page identifier (db_id:file_id:page_id) in the wait_resource column. In order to gain any insight into the problem, you have to know more details about what that page is. Is it a data page, an index page or something else? If it’s a data or index page, which object does it belong to? Prior to SQL Server 2019 CTP 2.0, the only way to gain any of this insight was to use the lightlydocumented command DBCC PAGE. Unfortunately, if you’re using some sort of monitoring tool or script to gather data about waits and blocking, there’s no good way to include this DBCC PAGE call in those scripts. Not to mention it requires sysadmin privileges. Enter sys.dm_db_page_info.

The sys.dm_db_page_info function takes 4 parameters: database ID, file ID, page ID and mode (‘LIMITED’ or ‘DETAILED’). It returns a table with a single row that contains the header information from the page, including things like object_id, index_id and partition_id. As a standalone function, this provides a documented and fully supported way to retrieve this information without the need for a DBCC command. The real benefit however is the ability to JOIN this function with other DMVs such as sys.dm_exec_requests or sys.sysprocesses to gather this information automatically when there is any sort of page-related contention. If you’re familiar with these views you know that the wait_resource column can contain all sorts of different resources, so it’s not in the right format to facilitate a direct join to sys.dm_db_page_info. To support this use case, we had to add another column to these views called page_resource. Whenever the wait_resource column contains a page resource, the page_resource column will be populated with an 8-byte hexadecimal value. The page ID is the first four bytes, followed by 2 bytes for the file ID and then 2 bytes for the database ID.

There are also a couple of supporting functions, so check those out.

Comments closed