Although the data involved is not large in volume, the types of data processing, data analytics, and machine-learning techniques used in this area are common to many Apache Hadoop use cases. So, fantasy sports analytics provides a good (and fun) use case for exploring the Hadoop ecosystem.
Apache Spark is a natural fit in this environment. As a data processing platform with embedded SQL and machine-learning capabilities, Spark gives programmatic access to data while still providing an easy SQL access point and simple APIs to churn through the data. Users can write code in Python, Java, or Scala, and then use Apache Hive, Apache Impala (incubating), or even Cloudera Search (Apache Solr) for exploratory analysis.
Baseball was my introduction to statistics, and I think that fantasy sports is a great way of driving interest in stats and machine learning. I’m looking forward to the other two parts of this series.