Using XPath To Shred HTML

Shannon Lowder shows off the HTML Agility Path project to help him parse the contents of webpages:

Let’s say we wanted the table. We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection. Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows. Notice the XPath looks a lot like a Linux or windows folder path. That’s the idea of XPath!

I would like to point out a couple of extra points. First, XPath is case sensitive. So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries. //body/table/tr would get you to the same place /html/body/table/tr did.

This intro is part of a series Shannon has started on scraping data from websites.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31