Datasets In Spark

Ayush Hooda explains the differences between DataFrames and Datasets in Apache Spark:

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Read on for more details and a few examples of how to operate DataFrames and Datasets.

Related Posts

Database-First or Kafka-First for Event Streaming

Gwen Shapiro takes us through a scenario where database-first writes for event streaming makes the most sense: Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more. If you were to go Kafka-first… well, […]

Read More

Handling Errors in Kafka Connect

Robin Moffatt shows us some techniques for handling errors in your Kafka topics: We’ve seen how setting errors.tolerance = all will enable Kafka Connect to just ignore bad messages. When it does, by default it won’t log the fact that messages are being dropped. If you do set errors.tolerance = all, make sure you’ve carefully thought through […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031