Press "Enter" to skip to content

Author: Kevin Feasel

Getting Started With Hadoop

Jon Morisi is looking at Hadoop:

From here I think I’ll start playing around on a sandbox.  Each of the distributions offers a way to spin up a VM or log into a cloud based environment.  There are also docker images out there. (search hadoop, cloudera, or hortonworks).  Most of these docker images look fairly new, so don’t cut yourself.

I’m looking at the HortonWorks distro, so I’ll probably setup a Hortonworks Sandbox.

Jon includes some resources he’s used to learn a bit about the topic.  I think he’s going down the right path with videos and mailing lists—the ecosystem changes too quickly for books to have much long-term value, and mailing lists & forums tend to be better for keeping up to date.  My biggest suggestion is to get case studies and play around.  Check out studies on e-mail ingest, real-time data analysis, and analyzing fantasy sports for starters.

Comments closed

Audit Select Statements

Jason Brimhall shows how to build an extended event session which audits all SELECT statements:

I have to be a little honest here. Prior to somebody asking how they could possibly achieve a statement audit via extended events, I had not considered it as a tool for the job. I would have relied on Audit (which is Extended Event related), or some home grown set of triggers. In this particular request, Audit was not fulfilling the want and custom triggers was not an option. Another option might have included the purchase of third party software but there are times when budget does not allow for nice expensive shiny software.

So, with a little prodding, I hopped into the metadata and poked around a bit to see what I could come up with to achieve this low-budget audit solution.

Read the whole thing.

Comments closed

Duplicate Statistics

Shaun J. Stuart discusses removing duplicate statistics:

I puzzled on this for a bit and got sidetracked by the strange way SSMS displays statistics columns on the Property page. Then it got to be the end of the day and I went home. The next day, I had a comment on my previous post from Aaron Bertrand who mentioned there is a related bug with the stats_column_id column of the sys.stats_columns view. It does not contain what the MSDN documentation says it contains. The Connect item for this, along with a workaround, can be found here.

The script I was using did not reference that column, but it did get me thinking that perhaps the script was not correctly identifying the first column in an index.

Shaun has an updated version of a duplicate statistics checker script that you may want to check out.

Comments closed

Azure SQL Data Warehouse Date Dimensions

Meagan Longoria shows how to create a date dimension in Azure SQL Data Warehouse:

Most data warehouses and data marts require a date dimension or calendar table. Those of us that have been building data warehouses in SQL Server for a while have collected our favorite scripts to build out a date dimension. For a standard date dimension, I am a fan of Aaron  Bertrand’s script posted on MSSQLTips.com. But the current version (as of Aug 8, 2016) of Azure SQL Data Warehouse doesn’t support computed columns, which are used in Aaron’s script.

Click through for the script.

Comments closed

Feed The CPUs

SQL Sasquatch is starting a new series on optimizing disk write to maximize CPU throughput:

When I work with SQL Server batch-controlled workflows, I use the theory “feed the CPUs”.  That’s the simplest positive adaptation I could come up with of Kevin Closson’s paradigm “Everything is a CPU problem” 🙂

What I mean by “Feed the CPUs” is that memory and disk response times are primary factors determining the maximum rate for the CPUs to process the data.  Nuts & bolts of such a model for SQL Server are slightly different than a similar model for Oracle.  SQL Server access to persistent data is always through database cache, while Oracle uses shared access to database cache in SGA and private access to persistent data through direct access in PGA.

Click through for more details.

Comments closed

Amazon Machine Learning

Ujjwal Ratan uses patient readmission data to demonstrate Amazon Machine Learning:

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

I think that Azure ML is still ahead of Amazon’s ML solution, but I’m happy to see the competition.

Comments closed

Explaining Yarn Container Memory Allocations

Skumar T explains container sizes in Yarn:

So jobs on yarn cluster runs in individual containers which is allocated by Node Manager which in turn gets permissions from Resource Manager.

So few configuration parameters of node manager those are important in context of jobs running in the containers.

–>yarn.nodemanager.resource.memory-mb  8192(value)

Amount of physical memory, in MB, that can be allocated for containers.

–>yarn.nodemanager.pmem-check-enabled  true(value)

Whether physical memory limits will be enforced for containers.

The bottom half of the article goes into an extended example.

Comments closed

While Loops

Lukas Eder discovers that Oracle’s PL/SQL has WHILE loops:

In SQL, everything is a table (see SQL trick #1 in this article), just like in relational algebra, everything is a set.

Now, PL/SQL is a useful procedural language that “builds around” the SQL language in the Oracle database. Some of the main reasons to do things in PL/SQL (rather than e.g. in Java) are:

  • Performance (the most important reason), e.g. when doing ETL or reporting
  • Logic needs to be “hidden” in the database (e.g. for security reasons)
  • Logic needs to be reused among different systems that all access the database

Much like Java’s foreach loop, PL/SQL has the ability to define implicit cursors (as opposed to explicit ones)

The WHILE loop is a little more helpful in the SQL Server world for doing things like deleting lots of rows in small batches, but I agree with Lukas’s sentiment:  if you start writing a WHILE loop, it’s best to sit back and think about whether this is the best decision.

Comments closed

Checkpoint Behavior Changes

Mike Ruthruff discusses changes in checkpoint behavior in SQL Server 2016:

The following are the primary changes which will impact behavior of checkpoint in SQL Server 2016.

  1. Indirect checkpoint is the default behavior for new databases created in SQL Server 2016. Databases which were upgraded in place or restored from a previous version of SQL Server will use the previous automatic checkpoint behavior unless explicitly altered to use indirect checkpoint.

  2. When performing a checkpoint SQL Server considers the response time of the I/O’s and adjusts the amount of outstanding I/O in response to response times exceeding a certain threshold. In versions prior to SQL Server 2016 this threshold was 20ms. In SQL Server 2016 the threshold is now 50ms. This means that SQL Server 2016 will wait longer before backing off the amount of outstanding I/O it is issuing.

  3. The SQL Server engine will consolidate modified pages into a single physical transfer if the data pages are contiguous at the physical level. In prior versions, the max size for a transfer was 256KB. Starting with SQL Server 2016 the max size of a physical transfer has been increased to 1MB potentially making the physical transfers more efficient. Keep in mind these are based on continuity of the pages and hence workload dependent.

Definitely read the whole thing.

Comments closed

Lambda Architecture Primer

James Serra explains the Lambda architecture:

A brief explanation of each layer:

Data Consumption: This is where you will import the data from all the various source systems, some of which may be streaming the data.  Others may only provide data once a day.

Stream Layer: It provides for incremental updating, making it the more complex layer.  It trades accuracy for low latency, looking at only recent data.  Data in here may be only seconds behind, but the trade-off is the data may not be clean.

Batch Layer: It looks at all the data at once and eventually corrects the data in the stream layer.  It is the single version of the truth, the trusted layer, where there is usually lots of ETL and a traditional data warehouse.  This layer is built using a predefined schedule, usually once or twice a day, including importing the data currently stored in the stream layer.

Presentation Layer: Think of it as the mediator, as it accepts queries and decides when to use the batch layer and when to use the speed layer.  Its preference would be the batch layer as that has the trusted data, but if you ask it for up-to-the-second data, it will pull from the stream layer.  So it’s a balance of retrieving what we trust versus what we want right now.

I hate the fact that this is named “lambda.”  That’s a term which is way too overloaded in computer science.  You have the architecture, lambda functions, and AWS lambda, all of which are utterly different and yet end up in the same conversation.  This ends up confusing people unless you very specifically say things like “We’re going to use the AWS lambda service to create lambda functions to feed data from sensors into our lambda architecture.”  And even then people still get confused.

Comments closed