Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3:

The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain.

Erasure encoding is disabled by default and you can enable it for only certain directories in HDFS. Some articles like this one suggest thatbest practice is to enable Erasure coding only for “cold” data that you do not write often, and for “hot” data use regular replication. However, in my tests I did not witness any problem dealing with hot data (maybe it’s evident in larger scales).

Click through for the full story on how it works.

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto:

Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. Hive is made up of over one million lines of Java and 100K lines of C++ code. Any libraries they share are out-weighted by the unique approaches they’ve taken in the architecture surrounding their SQL parsers, query planners, optimizers, code generators and execution engines when it comes to tabular form conversion.

I recently benchmarked Spark 2.4.0 and Presto 0.214 and found that Spark out-performed Presto when it comes to ORC-based queries. In this post I’m going to examine the ORC writing performance of these two engines plus Hive and see which can convert CSV files into ORC files the fastest.

The results surprised me.

Blinking Lifx Lights Without IFTTT

Kevin Feasel

2019-01-15

Python

Allison Tharp has a project to blink a set of Lifx lights a team’s color when they score:

The first step is to generate an API token via the Lifx API here (https://cloud.lifx.com/settings). Keep this token safe and don’t let others see it!

In my functions file, I created 3 new functions for controlling the lights: invoke-setLightinvoke-Pulse, and invoke-Breathe. To understand what the API was expecting, I followed the Lifx API documentation here. As far as API documentation goes, this one is pretty good. Most functions have an interactive portion at the bottom which allows you to test it out yourself and also see what inputs the API expects.

As a Bills fan, at least I wouldn’t have to worry about the lights wearing out from overuse.

LISTAGG In Snowflake DB

Koen Verbeeck continues investigating Snowflake capabilities:

Since SQL Server 2017, you have the STRING_AGG function, which has almost the exact same syntax as its Snowflake counterpart. There are two minor differences:
– Snowflake has an optional DISTINCT
– SQL Server has a default ascending sorting. If you want another sorting, you can specify one in the WITHIN GROUP clause. In Snowflake, there is no guaranteed sorting unless you specify it (again in the WITHIN GROUP clause).

It looks like LISTAGG is the ANSI standard name, though SQL Server followed Postgres’s lead in calling their function STRING_AGG.

When Synchronous AG Secondaries Are Out Of Sync

David Fowler explains that just because an Availability Group is set up as synchronous, doesn’t mean you can never experience data loss on failover:

The primary replica is constantly monitoring the state of it’s secondaries. With the use of a continuous ping, the primary node always knows if the secondaries are up or down.

It’s when SQL detects that one of it’s synchronous replicas goes offline is when interesting things can happen.

So here’s the discussion that came up, if a synchronous replica goes offline for whatever reason, SQL won’t be able to commit any transactions and that means we can be confident that the secondary is up to date, right?

Read on to learn the answer. Which is “no.” But David explains why, so you should read that instead of just having me say it.

Straight Talk On Trace Flags

Pam Lahoud explains the purpose of trace flags and talks about a very important trace flag, 4199:

Some trace flags are used to enable enhanced debugging features such as additional logging, memory dumps etc. and are used only when you are working with Microsoft Support to provide additional data for troubleshooting. These trace flags are not ones you want to leave turned on in a production system as they may have a negative impact on your workload. An example of one of these flags would be TF 2551 which is used to trigger a filtered memory dump whenever there is an exception or assertion in the SQL Server process. These trace flags are only used for a short period of time and typically only at the recommendation of Microsoft Support, so they will likely always be around.

If you are a DBA and are not extremely familiar with trace flags, you really want to read this article.

DBAs Aren’t Going Away, DevOps + Automation Edition

Grant Fritchey argues that the DBA role is here to stay:

One of the reasons I love DevOps so much is because I’ve done it successfully. I’ve worked on teams that built fully automated deployment mechanisms to get code from Dev to Production. Further, we automated the creation of dev & test servers. We automated the creation of production servers too. We automated the heck out of everything.
And then they fired me…
Kidding.
When we started building our DevOps processes, I was supporting two development teams. As we got better at automating our work, I was supporting three teams. By the time we had fully automated all the various processes, I was supporting between five and seven teams at different levels.

To support Grant’s point, I’ve had a draft in my personal blog entitled “The Cloud is not Stealing Our Jobs” from May of 2017 that I never got around to finishing. Back in 2017, that was what was going to kill the DBA role.

The role has certainly changed over the years. I suppose if your definition of a DBA is someone who lays out indexes starting on certain drive sectors to take advantage of rotation speed on that single 5400 RPM spinning disk drive AND NOTHING ELSE, then your job might not be there. But that describes exactly zero people I have ever known in the industry.

Integrating Azure Data Studio With GitHub

Eduardo Pivaral shows how to use Azure Data Studio to push to a Git repository on GitHub:

There are a lot of source control applications and software, everyone has its pros and cons, but personally, I like to use GitHub, since it is free to use and since it was recently acquired by Microsoft, support for other products is easier (SQL Server for this case).

On this post, I will show you how to implement a source control for a database using GitHub and Azure Data Studio (ADS).

Click through for the step-by-step instructions.

Categories

January 2019
MTWTFSS
« Dec Feb »
 123456
78910111213
14151617181920
21222324252627
28293031