Press "Enter" to skip to content

Day: April 6, 2021

Sending Large Messages in Kafka

The Hadoop in Real World team shows how you can send large messages in Apache Kafka:

By default the messages you can send and manage in Kafka should be less than 1 MB. To increase this limit there are few properties you need to change in both brokers and consumers.

Let’s say your messages can be up to 10 MB. So in this case your Kafka producers are producing messages up to 10 MB. So your Kafka Brokers and consumers should be able to store and receive messages up to 10 MB respectively.

Kafka Producer sends messages up to 10 MB ==> Kafka Broker allows, stores and manages messages up to 10 MB ==> Kafka Consumer receives messages up to 10 MB

Click through to see how, but also recognize that it’s usually a really bad idea to push large messages in a broker system. Even 1MB is probably going too far—I’d try to stay under 1KB if possible.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Synapse Studio in 5 Minutes

Kevin Chant wants 4 minutes and 58 seconds of your time:

In this post I want to do a five minute crash course about Synapse Studio. Because I have recently been asked to do this by colleagues.

In addition, I want to clear up some confusion about what you need to do before you can access Synapse Studio.

Aim of this post is for you will have a better overview of Synapse Studio within five minutes. Which happens to be the estimated reading time of this post.

Click through and be sure to start the stopwatch.

Comments closed

Embracing the XML

Grant Fritchey has some advice:

While XML is, without a doubt, a giant pain in the bottom, sometimes, the best way to deal with Extended Events is to simply embrace the XML.

Now, I know, just last week, I suggested ways to avoid the XML. I will freely admit, that is my default position. If I can avoid the XML, I will certainly do it. However, there are times where just embracing the XML works out nicely. Let’s talk about it a little.

Just need to do a little victory dance here. I didn’t explicitly say “embrace the XML” but close enough…

I think the biggest problem DBAs have with XML is that they end up treating it like a dreadful task: I need to shred XML for an extended event. But to do that, I have to learn how to query it using this quasi-language, and so they get stuck trying to fuss with something somebody else did, moving symbols around in the hopes that they get the right incantation. By contrast, a day or two really focusing in on how XQuery and XPath work would clarify a lot and make the process much simpler.

There is a fair counter-point in asking how often you’ll use this, and if the answer is “probably never,” then poke through and just try to get it working. But I’ve got a bit of bad news: “probably never” is probably wrong.

Comments closed

Deploying an Azure Arc Enabled Data Services Controller

Chris Adkin continues a series:

If you have been following this series, you should have:

– a basic understanding of Terraform
– a Kubernetes cluster that you can connect to using kubectl
– a basic understanding of Kubernetes services
– a working metalLB load balancer
– a basic understanding of how storage works in the world of Kubernetes
– a Kubernetes storage solution in the form of PX Store, alternatively you can use any solution (for the purposes of this series) which supports persistent volumes, however to use the backup solution in part 9 of the series you will need to use something that supports CSI

From here, Chris explains the importance of the data controller and then deploys one.

Comments closed

Columnstore, Strings, and Windowing Functions

Erik Darling has a tale to tell:

The only columns that we were really selecting from the Comments table were UserId and CreationDate, which are an integer and a datetime.

Those are relatively easy columns to deal with, both from the perspective of reading and sorting.

In order to show you how column selection can muck things up, we need to create a more appropriate column store index, add columns to the select list, and use a where clause to  restrict the number of rows we’re sorting. Otherwise, we’ll get a 16GB memory grant for every query.

Read on to see how one little (or, well, big) string column can foul up the whole works.

Comments closed

Indexing for Physical Join Operators

Deepthi Goguri continues a series on physical join operators:

In the Part1 of decoding the physical join operators, we learned about the different types of physical operators: Nested loops, Merge joins and Hash joins. We have seen when they are useful and how to take advantage of each for the performance of our queries. We have also seen when they are useful and when they needs to be avoided.

In this part, we will know more about these operators and how the indexes really help these operator to perform better so the queries can execute faster.

Read on to see how to define indexes for each of the three physical operators.

Comments closed