XPath for Channels: The Good, the Bad, and the Fugly

Being a web developer, I can be pretty sensitive to not only how web-based code works, but also how it's both coded and presented. Unfortunately, not everyone takes as much care as they should to generate concise and clean markup for their web pages and this can make things quite challenging if you're trying to write a plugin that parses or scrapes web content. For the situations where you don't have an RSS/ATOM feed, access some JSON feed, or any other type of data feed to work with in order to retrieve your media listing, your last resort is to parse the actual HTML content to find the media you're looking to include in your channel. For this approach XPath is generally the tool of choice and is included as part of Plex's channel framework.

The reason that this approach should be your last resort is simple: websites change. All the time. While most times those changes are generally just updated content or media listings or small layout changes on the pages within your site of choice, that's not always the case. Sometimes the changes are large or more intrusive to the layout and depending on the code change and how well you craft your XPath queries, this can and will affect the functionality of your channel. When not using an API or a well-maintained and backward-compatible feed of some sort, channels are always going to be a moving target.

If you write a channel based on parsing web pages, count on updating it in the future. Depending on how often your source site changes and how wisely you approach your XPath queries, sometimes you may end up having to fix your channel more often than is desired. Hopefully, this article will help point out some of the issues surrounding various approaches for XPath queries and help you make good choices in exactly how to get maximum results and require minimal and infrequent fixes.

A Perfect World

In a perfect world all websites would be written in clear, clean, and concise (and preferably validated) code that uses logical layouts and makes it easy to get at the sub-components that you want. Anyone who's ever used the “View Source” option in a web browser likely knows that this is rarely the case, but for our usage here let's show an example of what a snippet of this code might look like. Again for the sake of simplicity, things in this example like “Video 1″ would in the real world be much more complex than they are represented here but this should be enough to illustrate for our example.

<html>
    <body>
        <div id="header">
            <!-- Header content in here -->
        </div>

        <div id="content">
            <!-- main area content here -->
            <ul id="videoList">
                <li>Video 1</li>
                <li>Video 2</li>
                <li>Video 3</li>
            </ul>
        </div>

        <div id="footer">
            <!-- footer content here -->
        </div>
    </body>
</html>

If all sites were built like this and all the layouts and future changes to the layouts were done with CSS we could only be so lucky! While it would be rare to see something this simple out in the wild it should suffice for demonstration purposes to illustrate the differences between approaches for the best use of XPath for your channel. While the above code may be simple, things can still get tricky. There are many different ways we can get the exact same video data with XPath and some tend to work out better than others. Let's try a few different approaches.

What you might get from an XPath finder plugin or addon

/html/body/div[2]/ul/li

This example is a very literal “top down” approach to finding the element you're looking for. We are stepping through the document's objects one level at a time and leading directly to where our desired elements are. However, this is probably the worst approach to take when retreiving data via XPath for a Plex channel. While it's technically accurate and very explicit and will give you the result you are looking for in this example it is also very easily broken. One single element change anywhere on the page before your desired bits of code and we will no longer be finding what we're looking for. If you take this type of approach keep your text editor open and your channel code loaded, because you will need it and you will need it often.

Quick and Dirty:

//li

This example is very quick and dirty. We're searching the entire document for all instances of a list element. This approach also gives us the desired objects in this above example, but it is also very easily broken. If the site adds any other list element anywhere within the page we're scraping, there will again be breakage. So if you use this quick and dirty type approach, keep your channel open in your text editor for bug fixes.

A More Filtered Approach:

//div[@id="content"]//li

This is getting a little bit better. We're now restricting the search for list elements to the div with an id of content. While it is better than the above choices it still leaves a fair bit of chance for future breakage if any other list elements show up within the content div. This is especially important to keep in mind if your channel loads many similar (but not identical) pages or you load pages with dynamic content, which is often the case. While this may work perfectly on one page it may not on the next. A list element is used very frequently and for many different types of uses so this leaves too many options for possible breakage on subsequent or future pages.

A More Strict Approach:

//div[@id="content"]/ul[@id="videoList"]/li

Ok this is better still. We're now limiting the search for list elements within the undordered list with the id of videoList, which is within the div with the id of content. This will definitely limit us to a small possible set of items and they are extremely likely to have only the content we are looking for. The issue with this approach is that if the layout of the site changes and the enclosing div id changes we will again have breakage. While this is often a reasonable approach there is likely still an even safer way to approach this.

A More Flexible Approach:

//ul[@id="videoList"]/li

This will restrict us to finding list elements within the videoList unordered list. This is probably the best approach to take for this example in terms of it being the least likely to break due to minor site changes. Also worth noting is that if it does break (and it will break at some point!) then they have probably either done a large site changeover in terms of layout or something else fairly significant that will require your attention anyway. In this example they could move the unordered list with the id of videoList anywhere within their page and we would still get our desired video information.

Closing Words

In closing there are a few things to take away from this article. One is that there are many different ways to get the same result and you have to make an educated decision on which is the best way to approach it for your channel's needs. If your site is static and not going to change and you are happy using the results that an XPath addon will give you then it's not wrong to do so. However, if you want a robust and flexible channel that's going to stand the test of time, then make an effort to figure out what is likely to work best for your channel's needs as the source site changes.

Lastly scraping web based content is almost always an ongoing process and will often require constant maintenance, so it probably shouldn't be your first choice to get the data you desire. But if you do go this route it can often be done in a manner that may be flexible enough to not require fixes once a week when the source site's webmaster decides to do some re-arranging of their layouts.