Using XPath To Shred HTML

Shannon Lowder shows off the HTML Agility Path project to help him parse the contents of webpages:

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

This intro is part of a series Shannon has started on scraping data from websites.

Related Posts

Using Sqoop’s Logic To Improve Spark JDBC Performance

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster: Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.Since both Spark and Sqoop are based on the Hadoop map-reduce […]

Read More

Building A Full-Stack App With F#

Shanglun Wang has a tutorial on building a full application using F#: In the United States, there is a popular saying: “It’s five o’clock somewhere”.In some parts of the world, 5:00 pm is the earliest time when it is socially acceptable to have a drink, or a traditional cup of tea.Today, we will build an application […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031