Nov 292014
 

About a year ago I published an article entitled Parsing HTML with C++. It is by far my most popular article (second most popular being this one), and is a top result on Google for queries such as “html c++ parsing”. Nevertheless there is always room for improvement. Today, I present a revisit of the topic including a simpler way to parse, as well as a self-contained ready to go example (which many people have been asking me for.)
 

Old solution

Before today, my prescription for HTML parsing in C++ was a combination of the following libraries and associated wrappers:

cURL, of course, is needed to perform HTTP requests so that we have something to parse. Tidy was used to transform the HTML into XML that was then consumed by libxml2. libxml2 provided a nice DOM tree that is traversable with XPath expressions.
 

Shortcomings

This kludge presents a number of problems, with the primary one being no HTML5 support. Tidy doesn’t support HTML5 tags, so when it encounters one, it chokes. There is a version of Tidy in development that is supposed to support HTML5, but it is still experimental.

But the real sore point is the requirement to convert the HTML into XML before feeding it to libxml2. If only there was a way for libxml2 to consume HTML directly… Oh, wait.

At the time, I hadn’t realized that libxml2 actually had a built in HTML parser. I even found a message on the mailing list from 2004 giving a sample class that encapsulates the HTML parser. Seeing as though the last message posted was also in 2004, I suppose that there isn’t much interest.
 

New solution

With knowledge of the native HTML parser in hand, we can modify the old solution to completely remove libtidy from the mix. libxml2 by default isn’t happy with HTML5 tags either, but we can fix this by silencing errors (HTML_PARSE_NOERROR) and relaxing the parser (HTML_PARSE_RECOVER).

The new solution, then, requires solely cURL, libxml2, and their associated wrappers.

Below is a self-contained example that visits iplocation.net to acquire the external IP address of the current computer:

#include <libxml/tree.h>
#include <libxml/HTMLparser.h>
#include <libxml++/libxml++.h>

#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>

#include <iostream>
#include <string>

#define HEADER_ACCEPT "Accept:text/html,application/xhtml+xml,application/xml"
#define HEADER_USER_AGENT "User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17"

int main() {
    std::string url = "http://www.iplocation.net/";
	curlpp::Easy request;

	// Specify the URL
	request.setOpt(curlpp::options::Url(url));

	// Specify some headers
	std::list<std::string> headers;
	headers.push_back(HEADER_ACCEPT);
	headers.push_back(HEADER_USER_AGENT);
	request.setOpt(new curlpp::options::HttpHeader(headers));
    request.setOpt(new curlpp::options::FollowLocation(true));

	// Configure curlpp to use stream
	std::ostringstream responseStream;
	curlpp::options::WriteStream streamWriter(&responseStream);
	request.setOpt(streamWriter);

	// Collect response
    request.perform();
    std::string re = responseStream.str();

    // Parse HTML and create a DOM tree
    xmlDoc* doc = htmlReadDoc((xmlChar*)re.c_str(), NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Encapsulate raw libxml document in a libxml++ wrapper
    xmlNode* r = xmlDocGetRootElement(doc);
    xmlpp::Element* root = new xmlpp::Element(r);

    // Grab the IP address
    std::string xpath = "//*[@id=\"locator\"]/p[1]/b/font/text()";
    auto elements = root->find(xpath);
    std::cout << "Your IP address is:" << std::endl;
    std::cout << dynamic_cast<xmlpp::ContentNode*>(elements[0])->get_content() << std::endl;

    delete root;
    xmlFreeDoc(doc);

    return 0;
}

Install prerequisites and compile like this (Linux):

sudo apt-get install libcurlpp-dev libxml++2.6-dev
g++ main.cpp -lcurlpp -lcurl -g -pg `xml2-config --cflags --libs` `pkg-config libxml++-2.6 --cflags --libs` --std=c++0x
./a.out

 

Future work

In the near future, I will be releasing my own little wrapper class for cURL which simplifies a couple of workflows involving cookies and headers. It will make it easy to perform some types of requests with very few lines of code.

Something I need to investigate a little further is a small memory leak that occurs when I grab the content: dynamic_cast(elements[0])->get_content(). On my computer, it seems to range between 16-64 bytes lost. It may be a problem with libxml++ or just a false alarm by Valgrind.

Finally, I may consider following up on that mailing list post to see if we can get the HTML parser included in libxml++.

 Posted by at 9:15 pm
Feb 262013
 

I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.

Revisited! Please see the new article here.


My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.

The only decent sources of information I found were these three questions from Stack Overflow: Library Recommendation: C++ HTML ParserParse html using C, and XML Parser for C. Below is a summary of what I considered along with my take on each:

  • QWebElement – Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)
  • htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.
  • Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.
  • Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.

My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.

Here’s some sample code demonstrating Tidy and libxml++.
 

Step 1: Using Tidy to clean HTML and convert it to XML:

#include <tidy/tidy.h>
#include <tidy/buffio.h>

std::string CleanHTML(const std::string &html){
    // Initialize a Tidy document
    TidyDoc tidyDoc = tidyCreate();
    TidyBuffer tidyOutputBuffer = {0};

    // Configure Tidy
    // The flags tell Tidy to output XML and disable showing warnings
    bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes)
        && tidyOptSetBool(tidyDoc, TidyQuiet, yes)
        && tidyOptSetBool(tidyDoc, TidyNumEntities, yes)
        && tidyOptSetBool(tidyDoc, TidyShowWarnings, no);

    int tidyResponseCode = -1;

    // Parse input
    if (configSuccess)
        tidyResponseCode = tidyParseString(tidyDoc, html.c_str());

    // Process HTML
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidyCleanAndRepair(tidyDoc);

    // Output the HTML to our buffer
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer);

    // Any errors from Tidy?
    if (tidyResponseCode < 0)
        throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode);

    // Grab the result from the buffer and then free Tidy's memory
    std::string tidyResult = (char*)tidyOutputBuffer.bp;
    tidyBufFree(&tidyOutputBuffer);
    tidyRelease(tidyDoc);

    return tidyResult;
}

 

Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.

#include <libxml++/libxml++.h>

xmlpp::DomParser doc;

// 'response' contains your HTML
doc.parse_memory(CleanHTML(response));

xmlpp::Document* document = doc.get_document();
xmlpp::Element* root = document->get_root_node();

xmlpp::NodeSet elemns = root->find("descendant-or-self::*[@id = 'some_id']");
std::cout << elemns[0]->get_line() << std::endl;
std::cout << elemns.size() << std::endl;

 
Important note about namespaces
Something that took me a while to figure out is that libxml++ requires the full namespace when selecting tags.

This won’t work: *p/text()
But this will: *[local-name() = 'p']/text()

You could also specify the full namespace manually, but I find that local-name() is a much better option.

HTML5
Tidy by default doesn’t seem to support HTML5. For a version of Tidy that does, see here: http://w3c.github.io/tidy-html5/

More info

To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)


See the libxml++ class references: http://developer.gnome.org/libxml++/stable/annotated.html

Revisited! Please see the new article here.

 Posted by at 2:42 pm