Press "Enter" to skip to content

Curated SQL Posts

What’s New In Hadoop 3.0?

Shubham Sinha explains some of the changes coming to Hadoop:

Integrating EC with HDFS can maintain the same fault-tolerance with improved storage efficiency. As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks (6 data blocks + 3 parity blocks) of disk space. This only requires the storage overhead up to 50%.

Since Erasure coding requires additional overhead in the reconstruction of the data due to performing remote reads, thus it is generally used for storing less frequently accessed data. Before deploying Erasure code, users should consider all the overheads like storage, network and CPU overheads of erasure coding.

Now to support the Erasure Coding effectively in HDFS they made some changes in the architecture. Lets us take a look at the architectural changes.

There are some nice features coming to Hadoop version 3.

Comments closed

Optimizing Kafka

Yeva Byzek explains different tuning options available within Apache Kafka:

Without needing to make any changes to Kafka configuration parameters, you can setup a development Kafka environment and test basic functionality. Yet the fact that Kafka runs straight off the shelf does not mean you won’t want to do some tuning before you go into production. The reason to tune is that different use cases will have different sets of requirements that will drive different service goals. To optimize for those service goals, there are Kafka configuration parameters that you should change. In fact, the Kafka design itself provides configuration flexibility to users, and to make sure your Kafka deployment is optimized for your service goals, you absolutely should investigate tuning the settings of some configuration parameters and benchmarking in your own environment. Ideally, you should do that before you go to production, or at least before you scale out to a larger cluster size.

We have written a white paper to help you identify those service goals, configure your Kafka deployment to optimize for them, and ensure that you are achieving them through monitoring.

Read the whole thing, especially the part about throughput versus latency.

Comments closed

Why SQL On Linux

David Klee explains the benefits of SQL Server on Linux:

First and foremost (IMHO), Microsoft wants to appeal to developers. They want their development stack to run anywhere there are developers. Notably, Microsoft just released Visual Studio 2017 for Mac on May 10th! Many developers out there run on non-Microsoft workstations, notably Apple computers. Apple’s OSX operating system is originally derived from the FreeBSD operating system. FreeBSD and other *BSD operating systems share much in common with Linux. So, if you can make SQL Server work on the Apple, you’ve probably made it work on Linux. Arguably, covering these two platforms nails just about every widely adopted development platform out there.

Microsoft also wants to appeal to a broader customer base, which means exploring the other environments that software runs on. An exceptionally high number of the world’s servers are powered by Linux. It’s lean, mean, stable, and powerful. Lots of shops refuse to run a Windows-based server because of a number of reasons, including that their in-house IT staff only have Linux knowledge. These same shops are most likely pressured to run a SQL Server for various applications. I know a number of third-party vended application that require a SQL Server, and previously if an organization dictated no Windows-based servers, that meant that this application would never be adopted in the organization, no matter how well it would function.

David provides a good explanation and sets up the context behind his upcoming SQL Server on Linux series.

Comments closed

So You Want A FAQ-Bot

Steph Locke shows how easy it can be to create a Q&A bot:

Now we need to go to Azure and finish our bot.

Add a new Bot Service. You’ll need to give it a name and set which region you want to host it in. It will then setup everything in the background, and takes a couple of minutes.

Once it is successfully deployed, navigate your bot service and Create an App, making sure to copy and paste the values from the new tab into the interface. Select the Q&A Bot type. It should bring up a poppup that allows you to select your bot from a dropdown.

Bots are fun and worth learning, but that technology is still in its infancy, regardless of whose bot platform you’re using.

Comments closed

Triggers And Memory-Optimized Table Modifications

Jack Li troubleshoots a customer issue when trying to modify a memory-optimized table:

Recently I assisted on a customer issue where customer wasn’t able to alter a memory optimized table with the following error

Msg 41317, Level 16, State 3, Procedure ddl, Line 4 [Batch Start Line 35]
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.

If you access a memory optimized table, you can’t span database or access model or msdb.  The alter statement doesn’t involve any other database.

It turns out there was a DDL trigger defined on the instance that wrote data to msdb.  Click through for Jack’s repro script.  I’d be able to use memory-optimized tables a lot more frequently (to the chagrin of company DBAs, perhaps) if they supported cross-database operations.

Comments closed

Using Azure Cloud Shell

Jeffrey Verheul shows off a bit of Azure Cloud Shell:

Connecting to a database
Now that your Cloud Shell is ready to go, you can start using Bash. This means you can also use sqlcmd from within Bash.

You can connect to a database with sqlcmd, by using the following command:

sqlcmd -S servername.database.windows.net -U username -P password
Once the connection to your database has been made, you can run queries against it.

There’s no Powershell support yet, but Bash is currently supported and Powershell is in the works.

Comments closed

Creating An Azure Database For MySQL

Arun Sirpal shows how to create a new MySQL database in Azure:

Here you have the concept of compute units. No such thing as DTUs here but just as confusing.

Compute Units are a measure of CPU processing throughput guaranteed to be available to a single Azure Database for MySQL server. A Compute Unit is a blended measure of CPU and memory resources. In general, 50 Compute Units equate to half-core, 100 Compute Units equate to one core, and 2000 Compute Units equate to twenty cores of guaranteed processing throughput available to your server. I am not going to rehash official documentation on these concepts so I recommend reading https://docs.microsoft.com/en-gb/azure/mysql/concepts-compute-unit-and-storage

Different database product, different metric, it seems.  Check out Arun’s post as he walks you through the process step by step.

Comments closed

Azure MySQL Backups

Grant Fritchey focuses on an area where Azure’s MySQL Platform as a Service offering really makes sense:

Why Is MySQL Platform as a Service Important?

I am going to answer this question. There are a lot of advantages to creating, using and developing against data storage within a PaaS offering. One of the biggest for me is backups. Microsoft is automatically taking backups of the MySQL databases you create within Azure. These are real, full backups. Microsoft validates the backups. As I write this, you’ll have the ability to restore your entire database, to any point in time, at intervals of five minutes, over a 35 day preceding period. By programming against a MySQL database within Azure, you are gaining protection of the information you’re storing within your database, and you don’t have to do anything to benefit from this. It’s all part of the service.

Read the whole thing.

Comments closed

Sub-Second Hive Analytics

Carter Shanklin and Slim Bouguerra have started a series on using Hive and Druid to obtain sub-second SQL queries over terabytes of data:

We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results.  And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.

 WHAT IS DRUID?

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website.

This first post is mostly about Druid, which sounds like it might eventually become a very interesting technology for implementing Kimball-style warehouse models but for the whole “Joins?  We don’t need no steenkin’ joins” philosophy.  But when used as one engine component (as mentioned in the post), I can see it being quite useful.

Comments closed

Keeping Up With Analytics

Jen Underwood discusses the need to stay relevant in analytics and shares some tips on how to do so:

Although most analytics applications today still leverage older data warehouse and OLAP technologies on-premises, the pace of the cloud shift is significantly increasing. Infrastructure is getting better and is almost invisible in mature markets. Cloud fears are subsiding as more organizations witness the triumphs of early adopters. Instant, easy cloud solutions continue to win the hearts and minds of non-technical users. Cloud also accelerates time to market allowing for innovation at faster speeds than ever before. As data and analytics professionals, be sure to make time to learn a variety of cloud and hybrid analytics tools.

Exploring novel technologies across various ecosystems in the cloud world is usually as simple as spinning up a cloud image or service to get started. There are literally zillions of free and low cost resources for learning. As you dive into a new world of data, you will find common analytics architectures, design patterns, and types of technologies (hybrid connectivity, storage, compute, microservices, IoT, streaming, orchestration, database, big data, visualization, artificial intelligence, etc.) being used to solve problems.

It’s worth reading the whole thing.

Comments closed