Press "Enter" to skip to content

Category: Hadoop

TF-IDF using Spark .NET

Ed Elliott shows how you can use the Spark .NET library to perform machine learning in Apache Spark:

Native spark has two API’s for creating your ML applications. The historical one is Spark.MLLib and the newer API is Spark.ML. A little bit like how there was the old RDD API which the DataFrame API superseded, Spark.ML supersedes Spark.MLLib.

At the end of last year, .NET for Apache Spark had no support for either the Spark.ML or Spark.MLLib API’s so I started implementing Spark.ML. In a similar way that .NET for Apache Spark supports the DataFrame API and not the RDD API, I started with Spark.ML and I believe that having the full Spark ML API will be enough for anyone.

It’s awesome that Ed is helping to move Spark .NET forward in this way.

Comments closed

FlinkSQL in Cloudera Streaming Analytics

Marton Balassi announces support for FlinkSQL in Cloudera Streaming Analytics:

Our 1.2.0.0 release of Cloudera Streaming Analytics Powered by Apache Flink brings a wide range of new functionality, including support for lineage and metadata tracking via Apache Atlas, support for connecting to Apache Kudu and the first iteration of the much-awaited FlinkSQL API.

Flink’s SQL interface democratizes stream processing, as it caters to a much larger community than the currently widely used Java and Scala APIs focusing on the Data Engineering crowd. Generalizing SQL to stream processing and streaming analytics use cases poses a set of challenges: we have to tackle expressing infinite streams and timeliness of records. 

All is happening as Feasel’s Law foretold.

Comments closed

Comparing Koalas to PySpark

Tori Tompkins gives us an understanding of where Koalas fits in the Spark world:

One significant difference between Spark’s implementation of Dataframes and pandas is its immutability.

With Spark dataframes, you are unable to make changes to the existing object but rather create a brand new dataframe based on the old one. Pandas dataframes, however, allow you to edit the object in place. With Koalas, whilst still spark Dataframes under the hood, have kept the mutable syntax of pandas.

It does this by introducing this concept of an ‘Internal Frame’. This holds the spark immutable dataframe and manages the mapping between the Koalas column names and Spark column names. It also manages the Koalas index names to spark column name to replicate the index functionality in pandas (covered below). It acts as a bridge between Spark and Koalas by mimicking the pandas API with Spark. This Internal Frame replicates the mutable functionality of pandas by creating copies of the internal frame but appearing to be mutable.

Read the whole thing.

Comments closed

Good Practices when Combining Spark with Cassandra

Valerie Parham-Thompson shares some insights for working with Spark and Cassandra together:

Although we are focusing on Cassandra as the data storage in this presentation, other storage sources and destinations are possible. Another frequently used data storage option is Hadoop HDFS. The previously mentioned spark-cassandra-connector has capabilities to write results to Cassandra, and in the case of batch loading, to read data directly from Cassandra.

Native data output formats available include both JSON and Parquet. The Parquet format in particular is useful for writing to AWS S3. See https://aws.amazon.com/about-aws/whats-new/2018/09/amazon-s3-announces-new-features-for-s3-select/ for more information on querying S3 files stored in Parquet format. A good use case for this is archiving data from Cassandra.

Read on for more advice.

Comments closed

Cost-Cutting in Confluent Platform

Nick Bryan shares some techniques for reducing the cost of running on Confluent Platform:

To start, there are several Confluent Platform features that can greatly reduce your Kafka cluster’s infrastructure footprint. For use cases involving high data ingestion rates, lengthy data retention periods, or stringent disaster recovery requirements, Confluent Platform can help to reduce infrastructure costs by up to 50%.

One of the most important features for this cost category is Tiered Storage.

Read on for a few tips.

Comments closed

The Basics of Spark Streaming

Muskan Gupta gives us an introduction to Spark Streaming:

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Read on to understand the key mechanisms behind Spark Streaming.

Comments closed

Apache Spark Connector for SQL Server

The SQL Server team announces an open-sourced Apache Spark connector for SQL Server:

The Apache Spark Connector for SQL Server and Azure SQL is based on the Spark DataSourceV1 API and SQL Server Bulk API and uses the same interface as the built-in JDBC Spark-SQL connector. This allows you to easily integrate the connector and migrate your existing Spark jobs by simply updating the format parameter! 

This appears to be different from the old Spark connector to Azure SQL Database and SQL Server. Also, for anyone potentially confused between it and PolyBase, this is going in the opposite direction: the Spark connector lets you access a SQL Server from an Apache Spark cluster, reading SQL Server’s data and processing it across a number of executor nodes. By contrast, PolyBase lets you read data stored in Spark SQL tables from SQL Server, virtualizing it so that it looks like a regular SQL Server table.

Comments closed

Calculating Partitions for Processing Data Files in Apache Spark

Ajay Gupta digs into how to calculate the number of partitions the different Spark APIs use when reading from files:

Until recently, the process of picking up a certain number of partitions against a set of data files, always looked mysterious to me. However, recently, during an optimization routine, I wanted to change the default number of partitions picked by Spark for processing a set of data files, and that is when I started to decode this process comprehensively along with proofs. Hopefully, the description of this decoded process would also help the readers to understand Spark a bit deeper and would enable them to design an efficient and optimized Spark routine.

This is important information if you’re tuning Spark cluster performance.

Comments closed