The Hadoop in Real World team has a one-liner for us:
A quick answer that might come to your mind is to call the count() function on the dataframe and check if the count is greater than 0. count() on a dataframe with a lot of records is super inefficient.
count() will do a global count of records in the dataframe from all partitions and then add all the intermediate counts together to get the final count. You will find this approach very slow for big dataframes.
Click through for a much faster one-liner.Comments closed