Using XPath To Shred HTML

Shannon Lowder shows off the HTML Agility Path project to help him parse the contents of webpages:

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

This intro is part of a series Shannon has started on scraping data from websites.

Related Posts

The Decorator Pattern

Nancy Jain explains the Decorator pattern: Decorator design pattern is a structural design pattern. Structural design patterns focus on Class and Object composition and decorator design pattern is about adding responsibilities to objects dynamically. Decorator design pattern gives some additional responsibility to our base class. This pattern is about creating a decorator class that can […]

Read More

Why .NET And Java Have StringBuilders

Randolph West walks us through a performance troubleshooting issue with a twist: So we branch the the code in source control, and start writing a helper class to manage the data for us closer to the application. We throw in a SqlDataAdapter, use the Fill() method to bring back all the rows from the query in one go, and then […]

Read More

Categories

October 2017
MTWTFSS
« Sep Nov »
 1
2345678
9101112131415
16171819202122
23242526272829
3031