Press "Enter" to skip to content

Day: December 21, 2016

Spark Versus Flink

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams.  We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Comments closed

Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Comments closed

Understanding HDFS Disk Checks

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Comments closed

Using Polybase To Insert Into HDFS

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales:  with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster?  And if so, how close to linear scale can I get?

Comments closed

Operator Elapsed Time

Kendra Little shows off a really cool feature in SQL Server 2016 & 2014 SP2:

SQL Server now shows Actual Elapsed CPU Time and Actual Elapsed Time (duration) for each operator in an Actual Execution Plan

For SQL Server 2016 and 2014 SP2 and higher, actual execution plans contain a bunch of new information on each operator, including how much CPU they burn, how long it takes, and how much IO is done by that operator. This was a little hard to use for a while because the information was only visible in the XML of the execution plan.

Check out Kendra’s post for more details, including a couple caveats.

Comments closed

New T-SQL Features

Slava Murygin looks at some new functions in the vNext CTP 1.1:

Since Microsoft introduced XML support in SQL Server, the most common string concatenation technique was use of “XML PATH(”)” like this:

SELECT SUBSTRING(
(SELECT ‘, ‘ + name FROM master.sys.tables
FOR XML PATH(”))
,3,8000);
GO

Now you can aggregate your strings by using function “STRING_AGG”:

SELECT STRING_AGG(name, ‘, ‘) FROM master.sys.tables;

Read on for the other three.  This aggregation function, however, would make some of my code a lot simpler and easier to explain to junior database developers.  I just want it to perform well is all.

Comments closed

Finding Database Backup History

Jason Brimhall has a script to determine when your databases have been backed up:

I have also set the script to accept a database name parameter. If a name is provided, then only the backup history for that database is returned. If the parameter is left NULL, then the backup history for all databases will be returned. Additionally, I added a number of days parameter to limit the scope of the report to a specific range of days.

Among the data points returned in this script, you will note there is the duration of the backup, the date, and even the size of the backup. All of these attributes can help me to forecast future storage requirements both for the backup storage as well as for the data volume. Additionally, by knowing the duration of the backup and the trend of that duration, I can adjust maintenance schedules accordingly.

These types of reports are quite useful.  Aside from giving you useful information on database backups, it should also be a reminder to have a process which deletes older backup data over time so that your msdb database doesn’t grow to unsustainable sizes.

Comments closed

The Story Behind SQL Server On Linux

Slava Oaks is back blogging and gives us some insight on how SQL Server on Linux came to be:

To give you an idea of the effort involved, the SQL Server RDBMS and other services that ship with it in the SQL Server product suite account for more than 40 million lines of C++ code. Even though SQL Server has a resource management layer called SQLOS, the codebase bleeds Win32 semantics throughout. This means a pure port could take years just to get compiling and booting let alone figuring out things like performance and feature parity with SQL Server on Windows. In addition, doing a porting project while other SQL Server innovation is happening in the same codebase would have been a daunting task and keep the team in a close to endless catch-up game.

In conclusion, even though the potential job-offer intrigued me, it felt like an impossible task for one to take on.

It’s a great story, one which I never would have thought possible six years ago.

Comments closed

New Powershell And SQL Server Previews For Linux

Max Trinidad notes that there are new versions of Powershell and SQL Server previews available for Linux users:

To download the latest PowerShell Open Source just go to the link below:

https://github.com/PowerShell/PowerShell

Just remember to remove the previous version, and any existing folders as this will be resolved later.

To download the latest SQL Server vNext just check the following Microsoft blog post as the new CTP 1.1 includes version both Windows and Linux:

SQL Server next version Community Technology Preview 1.1 now available

Max has additional links and resources in that post as well.

Comments closed