Press "Enter" to skip to content

Author: Kevin Feasel

How Kafka Does Exactly-Once

Neha Narkhede explains exactly-once messaging and describes how Kafka implements their exactly-once process:

In a distributed system, the computers that make up the system can always fail independently of one another. In the case of Kafka, an individual broker can crash, or a network failure can happen while the producer is sending a message to a topic. Depending on the action the producer takes to handle such a failure, you can get different semantics:

  • At least once semantics: if the producer receives an acknowledgement (ack) from the Kafka broker and acks=all, it means that the message has been written exactly once to the Kafka topic. However, if a producer ack times out or receives an error, it might retry sending the message assuming that the message was not written to the Kafka topic. If the broker had failed right before it sent the ack but after the message was successfully written to the Kafka topic, this retry leads to the message being written twice and hence delivered more than once to the end consumer. And everybody loves a cheerful giver, but this approach can lead to duplicated work and incorrect results.

  • At most once semantics: if the producer does not retry when an ack times out or returns an error, then the message might end up not being written to the Kafka topic, and hence not delivered to the consumer. In most cases it will be, but in order to avoid the possibility of duplication, we accept that sometimes messages will not get through.

  • Exactly once semantics: even if a producer retries sending a message, it leads to the message being delivered exactly once to the end consumer. Exactly-once semantics is the most desirable guarantee, but also a poorly understood one. This is because it requires a cooperation between the messaging system itself and the application producing and consuming the messages. For instance, if after consuming a message successfully you rewind your Kafka consumer to a previous offset, you will receive all the messages from that offset to the latest one, all over again. This shows why the messaging system and the client application must cooperate to make exactly-once semantics happen.

Read on for a discussion of technical details.  I appreciate how Neha linked to a 60+ page design document as well, for those wanting to dig into the details.

Comments closed

Connecting Hadoop To LDAP

Giovanni Lanzani walks through some of the difficulties of getting LDAP working with Hadoop:

This section could probably could have much less workarounds if I’d knew more about LDAP.

But I’m a data scientist at heart and I want to get things done.

If you ever dealt with Hadoop, you know that there are a bunch of non-interactive users, i.e. users who are not supposed to login, such as hdfs, spark, hadoop, etc. These users are important to have. However the groups with the same name are also important to have. For example when using airflow and launching a spark job, the log folders will be created under the airflow user, in the spark group.

LDAP, however, doesn’t allow you, to my knowledge, to have overlapping user/groups, as Unix does.

The way I solved it was to create, in LDAP, the spark_user (or hdfs_user or …) to work around this limitation.

Also, Giovanni apparently lives in an interesting neighborhood.

Comments closed

Neural Nets With R And Power BI

Leila Etaati continues her series on using neural nets in Power BI:

we are going to predict the concrete strength using neural network. neural network can be used for predict a value or class, or it can be used for predicting multiple items. In this example, we are going to predict a value, that is concrete strength.

I have loaded the data in power bi first, and in “Query Editor” I am going to write some R codes. First we need to do some data transformations. As you can see in the below picture number 2,3 and 4,data is not in a same scale, we need to do some data normalization before applying any machine learning. I am going to write a code for that (Already explained the normalization in post KNN). So to write some R codes, I just click on the R transformation component (number 5).

There’s a lot going on in this demo; check it out.

Comments closed

Loan Chargeoff Templates

Ajay Jagannathan announces a couple new Cortana Intelligence Solutions Gallery templates:

For more information, read this blog: End to End Loan ChargeOff Prediction Built Using Azure HDInsight Spark Clusters and SQL Server 2016 R Service

We have published two solution templates deployable using two technology stacks for the above chargeoff scenario:-

  1. Loan Chargeoff Prediction using SQL Server 2016 R Services – Using DSVM with SQL Server 2016 and Microsoft ML, this solution template walks through how to create and clean up a set of simulated data, use 5 different models to train, select the best performant model, perform scoring using the model and save the prediction results back to SQL Server. A PowerBI report connects to the prediction table and show interactive reports with the user on the chargeoff prediction.

  2. Loan Chargeoff Prediction using HDInsight Spark Clusters – This solution demonstrates how to develop machine learning models for predicting loan chargeoff (including data processing, feature engineering, training and evaluating models), deploy the models as a web service (on the edge node) and consume the web service remotely with Microsoft R Server on Azure HDInsight Spark clusters. The final predictions is saved to a Hive table which could be visualized in Power BI.

These tend to be nice because they show you how the different pieces of the Azure stack tie together.

Comments closed

Wait Stats In Query Store

Andrejs Antjufejevs has great news if you’re using Query Store:

Starting today in Azure SQL Database and from CTP 2.0 of SQL Server 2017 wait stats per query are available in Query Store. Now you can exactly identify why and how much every plan waited for some resource. Information about wait times are persisted so you can also analyze through the history what was the problems and why queries waited for resources

This is a welcome improvement for query tuners on 2017.

Comments closed

Automatic Temporal Table Data Purging

Bert Wagner has warmed the cockles of my heart with this news:

The problem with temporal tables is that they produce a lot of data. Every row-level change stored in the temporal table’s history table quickly adds up, increasing the possibility that a low-disk space warning is going to be sent to the DBA on-call.

In the future with SQL Server 2017 CTP3, Microsoft allows us to add a retention period to our temporal tables, making purging old data in a temporal table as easy as specifying:

I’m in a situation where this will be very useful.

Comments closed

Test The DBATools Beta

Chrissy LeMaire wants you to test the beta of dbatools:

Before the official release of bagofbobbish to master and the PowerShell Gallery, we need help finding bugs. Then, we’ll need some time to resolve those bugs. Hopefully this can be done before community members show off dbatools at a few key SQLSaturdays around the world this Saturday, July 8th.

We would really appreciate it if you would download the beta from GitHub and (in a test environment) see if you can find anything that doesn’t work as expected.

If you find any bugs, please file a report on GitHub. You can also reach out to us in the Slack channel.

Currently, there aren’t any webpages for the commands listed in this post, but all commands have help, so when you need help, simply type Get-Help commandName -Examples or Get-Help commandName -Full.

Get testing.  There are a lot of new commands, so if you haven’t checked out dbatools in a while, give it a go.  Also, congrats to Rob Sewell for his newly minted MVP status.

Comments closed

Option Explicit In Biml VB

Ben Weissman now has nothing to stop him from writing bad VB code in Biml:

Previously, you had to declare any kind of variable and object type (instead of just using something like “var” in C#):

1
2
3
4
5
6
7
8
<#@template Language="VB" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
<# for n as integer = 1 to 25 #>
    <Package Name="MyAutomatedBiml<#= n #>"/>
<# next#>
</Packages>
</Biml>

That was true even for the most simple cases like for n as integer = 1 to 25 instead of just for n = 1 to 25, even though it is clear that his can never be anything else in this context.

Now, you can use two new attributes in the template definition: optionexplicit and/or optionstrict

Read the whole thing if you want to write VB code in Biml.  If you want to write C# code in Biml, keep doing your thing.  If you want to write F# in Biml, the pitchfork mob is organizing over here.

Comments closed

Agorics

One of my interests about a decade ago was agorics, the study of computational markets.  Mark S. Miller and K. Eric Drexler pushed this idea in the late 1990s and collected a fair portion of the work on the topic on Drexler’s website.  A sample from the section on computation and economic order:

Trusting objects with decisions regarding resource tradeoffs will make sense only if they are led toward decisions that serve the general interest-there is no moral argument for ensuring the freedom, dignity, and autonomy of simple programs. Properly-functioning price mechanisms can provide the needed incentives.

The cost of consuming a resource is an opportunity cost-the cost of giving up alternative uses. In a market full of entities attempting to produce products that will sell for more than the cost of the needed inputs, economic theory indicates that prices typically reflect these costs.

Consider a producer, such as an object that produces services. The price of an input shows how greatly it is wanted by the rest of the system; high input prices (costs) will discourage low-value uses. The price of an output likewise shows how greatly it is wanted by the rest of the system; high output prices will encourage production. To increase (rather than destroy) value as ‘judged by the rest of the system as a whole’,a producer need only ensure that the price of its product exceeds the prices (costs) of the inputs consumed. This simple, local decision rule gains its power from the ability of market prices to summarize global information about relative values.

 

I still think it’s an interesting concept, and the rise of cloud computing has, to an extent, fulfilled this idea:  AWS spot pricing is one of the best examples I know of, where resource spot prices will change depending upon load.

Comments closed

HASSP

Drew Furgiuele wants to put SQL Server into space.  I’ve linked to the entire series thus far, which has been fun to follow.  Here’s an excerpt from his latest post:

That’s from the “Hardware and Software Requirements for Installing SQL Server” product page.  I’ve had people ask if I was using a Raspberry Pi, or some other Micro ATX PC. The answer is neither; the problem with a Pi is that it’s not a 64-bit processor. Pis use ARM architecture, and SQL Server doesn’t (yet) support ARM. Also, most Pi 3’s run at 1.2Ghz and only support 1GB of RAM. As for MicroATX form factor PCs, they’re closer to what we’d need, but they’re still heavy. Plus, you’d need a pretty substantial power supply that we couldn’t (safely) support that high up in the sky. Even if you stripped it down to bare components, it’d be pushing it.

There are companies that make small SoC solutions, but after evaluating them we determined that they were either pretty flaky or got so hot they risked bursting into flames even just booting into Windows. Instead, we found a really unique piece of hardware: the Intel Joule.

Comments closed