Press "Enter" to skip to content

Monitoring Streaming Queries in PySpark

Hyukjin Kwon, et al, lay out some monitoring advice:

Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. However, monitoring streaming data workloads is challenging because the data is continuously processed as it arrives. Because of this always-on nature of stream processing, it is harder to troubleshoot problems during development and production without real-time metrics, alerting and dashboarding.

Read on to see how you can use the Observable API for alerting in PySpark—previously, it had been a Scala-only API.