Using XPath To Shred HTML

Shannon Lowder shows off the HTML Agility Path project to help him parse the contents of webpages:

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

This intro is part of a series Shannon has started on scraping data from websites.

Related Posts

Basics Of Elasticsearch In .NET

Ivan Cesar gives us a brief tutorial of the Elasticsearch .NET API: To be able to search something, we must store some data into ES. The term used is “indexing.” The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be […]

Read More

Working With CosmosDB

Derik Hammer has an introductory article showing how to work with CosmosDB to store and use document-style data: Querying Cosmos DB is more powerful and versatile. The CreateDocumentQuery method is used to create an IQueryable<T> object, a member of System.Linq, which can output the query results. The ToList() method will output a List<T> object from the System.Collections.Generic namespace. Derik also shows how to import […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories

October 2017
MTWTFSS
« Sep  
 1
2345678
9101112131415
16171819202122
23242526272829
3031