Press "Enter" to skip to content

Author: Kevin Feasel

Ambari 2.4

Jeff Sposetti discusses improvements in Ambari 2.4:

Reduce time to troubleshoot problems. Apache Hadoop components create a lot of log data. Accessing that log data to understand what the component is telling you, especially when issues arise, is critical. Apache Ambari includes a new Log Search service that provides agents for log collection and a delivers a custom UI for searching those logs. This is essential to providing a streamlined approach to searching for stack traces and exceptions across all nodes in the cluster.

I have enjoyed watching Ambari mature as a product.

Comments closed

Waiting For Rollback

Andrea Allred ran into an issue with a long-running job on an Availablity Group:

I panicked. In this situation I would normally pull the database out of the AG and then re-add it.  I didn’t have that option because it is a HUGE database and didn’t have that much time and space to move it around. I knew a large transaction had kicked off (thank you alert email that I created to warn me about such things) but thought that surely the rollback would have cleared quickly.  That lead me to looking for rolling back transactions.

Fortunately, the issue was on a secondary not under heavy use, so she was able to recover in time.

Comments closed

Flink And Kafka Streams

Neha Narkhede and Stephan Ewen compare Apache Flink versus Kafka Streams:

Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order oftens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as few 10s of milliseconds, (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.

The upshot is that the two products don’t do exactly the same thing, and there might be room in your organization for the two of them.

Comments closed

The Joy Of Hyperparameters

Koos van Strien shows how to tune hyperparameters using Azure ML:

Today, we’ll focus on tuning the model’s properties. We won’t discuss the details of all properties (you can easily look that up in the docs), instead we’ll look at how to test for different parameter combinations insize Azure ML Studio.

As soon as you click on an untrained model inside your experiment, you’ll be presented with some parameters – or, in ML parlance, hyperparameters – you can tweak.

Parameter tuning is pretty easy using Azure ML.

Comments closed

Tornado Visual

Devin Knight looks at the Tornado chart:

  • The Tornado has a few limitation that should be aware of before using

    • If there’s a legend value it should only have 2 distinct values

    • Each distinct category values is a separate bar with left or right parts

    • Alternatively, you can have two measure values and compare them without  a legend

I’m split on whether I like the tornado or not.  It is intuitive and information-dense, which are two major factors in its favor.  It is, however, difficult to read and compare.  This seems like a useful “big picture” chart, but you’d want to organize the data in a different way when you start drilling down.

Comments closed

Powershell Workflows

Cody Konior has a beef with Powershell workflows:

That’s inexplicable.

One thing which does make it all work is setting $PSRunInProcessPreference which, “If this variable is specified, all activities in the enclosing scope are run in the workflow process.” Unfortunately that doesn’t explain what’s really going on and what the impacts are, so I won’t use it. But here it is turning the original failing script into a working one.

I’ve never used Powershell workflows.  It sounds like potentially an exasperating experience.

Comments closed

Renaming SQL Servers

Wayne Sheffield shows what to do when you need to rename your SQL Server instance:

Sometimes you make a mistake, and forget to rename a syspred’d server before installing SQL Server. Or perhaps your corporate naming standard has changed, and you need to rename a server. Maybe you like to waste the time involved in troubleshooting connection issues after a server rename. In any case, you now find yourself where the name of the SQL Server is different than the physical name of the server itself, and you need to rename SQL Server to match the server’s physical name.

You could always rerun the setup program to rename the server. Fortunately, SQL Server provides an easier way to do this. You just need to run two stored procedures:sp_dropserver and sp_addserver.

Click through for details, including important considerations.

Comments closed

CHECKDB And Indexes On Persisted Computed Columns

Arun Sirpal diagnoses a slower-than-usual DBCC CHECKDB run:

All the signs of CHECKDB Latch contention.

DBCC – OBJECT – METADATA this latch can be a major bottleneck for DBCC consistency checks when indexes on computed columns exist.  As a side note DBCC_Multiobject scanner  is used to get the next set of pages to process during a consistency check.

Read on for the details and Arun’s solution.

Comments closed

Hadoop For .NET Developers

Elton Stoneman has a new Pluralsight course out:

My latest Pluralsight course is out now:

Hadoop for .NET Developers

It takes you through running Hadoop on Windows and using .NET to write MapReduce queries – proving that you can do Big Data on the Microsoft stack.

The course has five modules, starting with the architecture of Hadoop and working through a proof-of-concept approach, evaluating different options for running Hadoop and integrating it with .NET.

I’ve liked Elton’s courses, as he’s one of the few trainers who really takes the time to show how you can integrate .NET languages into a Hadoop ecosystem; the general philosophy is “go learn Java and Scala and Python and …”

Comments closed