Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers:

We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.

Click through for more additions and demos.

Literate Programming And Notebooks

David Smith sums up a debate on notebooks versus literate programming:

There’s no video yet available of Joel’s talk, but you can guess the theme of that opening slide, and walking through the slides conveys the message well, I think. Yuhui Xie, author and creator of the rmarkdown package, provides a detailed summary and response to Joel’s talk, where he lists Joel’s main critiques of Notebooks:

  1. Hidden state and out-of-order execution

  2. Notebooks are difficult for beginners

  3. Notebooks encourage bad habits

  4. Notebooks discourage modularity and testing

  5. Jupyter’s autocomplete, linting, and way of looking up the help are awkward

  6. Notebooks encourage bad processes

  7. Notebooks hinder reproducible + extensible science

  8. Notebooks make it hard to copy and paste into Slack/Github issues

  9. Errors will always halt execution

  10. Notebooks make it easy to teach poorly

  11. Notebooks make it hard to teach well

Read the whole thing.  I agree with some of these points, but disagree with a few on the list.

Scheduling Jupyter Notebooks

Matthew Seal, et al, explain how they schedule runs of Jupyter notebooks:

On the surface, notebooks pose a lot of challenges: they’re frequently changed, their cell outputs need not match the code, they’re difficult to test, and there’s no easy way to dynamically configure their execution. Furthermore, you need a notebook server to run them, which creates architectural dependencies to facilitate execution. These issues caused some initial push-back internally at the idea. But that has changed as we’ve brought in new tools to our notebook ecosystem.

The biggest game-changer for us is Papermill. Papermill is an nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. What Papermill does is rather simple. It take a notebook path and some parameter inputs, then executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.

Papermill does look quite interesting.

Using Notebooks At Netflix

Michelle Ufford, et al, explain why and how they use Jupyter Notebooks at Netflix:

Notebooks were first introduced at Netflix to support data science workflows. As their adoption grew among data scientists, we saw an opportunity to scale our tooling efforts. We realized we could leverage the versatility and architecture of Jupyter notebooks and extend it for general data access. In Q3 2017 we began this work in earnest, elevating notebooks from a niche tool to a first-class citizen of the data platform.

From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook.Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across the entire Data Platform.

Today, notebooks are the most popular tool for working with data at Netflix.

Good article.  I love notebooks for two reasons:  pedagogical purposes (it’s easier to show a demo in a notebook) and forcing you to work linearly.

Writing Better Jupyter Notebook Code

Henk Griffioen shows how to write Python code in your IDE of choice and then synchronize a Jupyter Notebook with the results:

How can you get the interactivity back and get our changes immediately in our Notebook? Add %autoreload at the top of your Notebook:

%loadext autoreload # Load the extension%autoreload 2 # Autoreload all modules

%autoreload is a Jupyter extension that reloads modules before executing your code. Functions and classes loaded in notebooks get their functionality updated every time you execute a cell. This means that when new code is saved in the editor, the changes are immediately loaded in your Notebook if you run a cell.

Using %autoreload bridges the gap between Notebook and IDE. You gain all the benefits of an IDE, but you’re still as flexible as before! See the GIF at the top as an example.

That’s a useful trick.  I’ve tended to use notebooks more for post-hoc work, where I’ve already structured my code and want to formalize it for others to use.

Analyzing Clickstream Data With Spark

Tony Cruz and Denny Lee analyze advertising data in Spark and predict click counts given certain input features:

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle.  The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

  • ETL

  • Data Exploration, for example, using SQL

  • Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown:

RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF,  and Word.

Click through for the demo.

Introducing Azure Notebooks

Zach Stagers has an introductory post to Azure Notebooks:

No installation, no maintenance

As with any PaaS solution, Azure Notebooks makes it far quicker and easier to get up and running, as there’s no download or installation required. Microsoft handles all the maintenance for you too!

I’m working on a fairly big project using Azure Notebooks.  It’s very helpful getting 1GB of space, so I can include all of my data, images, etc. from a fairly large number of notebooks.  The big downside is that the server running these notebooks is pretty slow—even for a fairly simple ARIMA model, I had it sitting there for 10 minutes at 100% CPU.  So don’t expect to run a heavy workload against it.

Jupyter Notebooks In Azure

Steve Jones looks at using Jupyter Notebooks in Azure:

There’s a new feature in Azure, and I stumbled on it when someone posted a link on Twitter. Apologies, I can’t remember who, but I did click on the Azure Notebooks link and was intrigued. I’ve gotten Jupyter notebooks running on my local laptop, but these are often just on one machine. Having a place to share a notebook in the cloud is cool.

Once I clicked on the link, I found these are both R and Python notebooks, as well as F#. These allow you to essentially build a page of code and share it. It’s kind of like a REPL, kind of like a story. It’s a neat way of working through a problem. I clicked the Get Started link to get going and was prompted for a User ID.

I’m a major fan of using notebooks for validating results as well as training people.

Deploying Jupyter Notebooks

Teja Srivastasa has an example of deploying a Jupyter notebook for production use on AWS:

No one can deny how large the online support community for data science is. Today, it’s possible to teach yourself Python and other programming languages in a matter of weeks. And if you’re ever in doubt, there’s a StackOverflow thread or something similar waiting to give you the perfect piece of code to help you.

But when it came to pushing it to production, we found very little documentation online. Most data scientists seem to work on Python notebooks in a silo. They process large volumes of data and analyze it — but within the confines of Jupyter Notebooks. And most of the resources we’ve found while growing as data scientists revolve around Jupyter Notebooks.

Another option might be to use JupyterHub.

Categories

October 2018
MTWTFSS
« Sep  
1234567
891011121314
15161718192021
22232425262728
293031