Using XPath To Shred HTML

Shannon Lowder shows off the HTML Agility Path project to help him parse the contents of webpages:

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

This intro is part of a series Shannon has started on scraping data from websites.

Related Posts

Error Handling In Scala

Manish Mishra gives a few examples of how to handle errors in Scala: Try[T] is another construct to capture the success or a failure scenarios. It returns a value in both cases. Put any expression in Try and it will return Success[T] if the expression is successfully evaluated and will return Failure[T] in the other case […]

Read More

Azure Functions Basics

Vincent-Philippe Lauzon explains the basics of Azure Functions: In general, serverless refers to an economical model where we pay for compute resources used as opposed to “servers”. Wait…  isn’t that what the Cloud is about? Well, yes, on a macro-scale it is, but serverless brings it to a micro-scale. In the cloud we can provision […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031