C++ HTML

Parsing HTML with C++: Revisited

About a year ago I published an article entitled Parsing HTML with C++. It is by far my most popular article (second most popular being this one), and is a top result on Google for queries such as “html c++ parsing”. Nevertheless there is always room for improvement. Today, I present a revisit of the topic including a simpler way to parse, as well as a self-contained ready to go example (which many people have been asking me for.)
 

Old solution

Before today, my prescription for HTML parsing in C++ was a combination of the following libraries and associated wrappers:

cURL, of course, is needed to perform HTTP requests so that we have something to parse. Tidy was used to transform the HTML into XML that was then consumed by libxml2. libxml2 provided a nice DOM tree that is traversable with XPath expressions.
 

Shortcomings

This kludge presents a number of problems, with the primary one being no HTML5 support. Tidy doesn’t support HTML5 tags, so when it encounters one, it chokes. There is a version of Tidy in development that is supposed to support HTML5, but it is still experimental.

But the real sore point is the requirement to convert the HTML into XML before feeding it to libxml2. If only there was a way for libxml2 to consume HTML directly… Oh, wait.

At the time, I hadn’t realized that libxml2 actually had a built in HTML parser. I even found a message on the mailing list from 2004 giving a sample class that encapsulates the HTML parser. Seeing as though the last message posted was also in 2004, I suppose that there isn’t much interest.
 

New solution

With knowledge of the native HTML parser in hand, we can modify the old solution to completely remove libtidy from the mix. libxml2 by default isn’t happy with HTML5 tags either, but we can fix this by silencing errors (HTML_PARSE_NOERROR) and relaxing the parser (HTML_PARSE_RECOVER).

The new solution, then, requires solely cURL, libxml2, and their associated wrappers.

Below is a self-contained example that visits iplocation.net to acquire the external IP address of the current computer:

#include <libxml/tree.h>
#include <libxml/HTMLparser.h>
#include <libxml++/libxml++.h>

#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>

#include <iostream>
#include <string>

#define HEADER_ACCEPT "Accept:text/html,application/xhtml+xml,application/xml"
#define HEADER_USER_AGENT "User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17"

int main() {
    std::string url = "http://www.iplocation.net/";
	curlpp::Easy request;

	// Specify the URL
	request.setOpt(curlpp::options::Url(url));

	// Specify some headers
	std::list<std::string> headers;
	headers.push_back(HEADER_ACCEPT);
	headers.push_back(HEADER_USER_AGENT);
	request.setOpt(new curlpp::options::HttpHeader(headers));
    request.setOpt(new curlpp::options::FollowLocation(true));

	// Configure curlpp to use stream
	std::ostringstream responseStream;
	curlpp::options::WriteStream streamWriter(&responseStream);
	request.setOpt(streamWriter);

	// Collect response
    request.perform();
    std::string re = responseStream.str();

    // Parse HTML and create a DOM tree
    xmlDoc* doc = htmlReadDoc((xmlChar*)re.c_str(), NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Encapsulate raw libxml document in a libxml++ wrapper
    xmlNode* r = xmlDocGetRootElement(doc);
    xmlpp::Element* root = new xmlpp::Element(r);

    // Grab the IP address
    std::string xpath = "//*[@id=\"locator\"]/p[1]/b/font/text()";
    auto elements = root->find(xpath);
    std::cout << "Your IP address is:" << std::endl;
    std::cout << dynamic_cast<xmlpp::ContentNode*>(elements[0])->get_content() << std::endl;

    delete root;
    xmlFreeDoc(doc);

    return 0;
}

Install prerequisites and compile like this (Linux):

sudo apt-get install libcurlpp-dev libxml++2.6-dev
g++ main.cpp -lcurlpp -lcurl -g -pg `xml2-config --cflags --libs` `pkg-config libxml++-2.6 --cflags --libs` --std=c++0x
./a.out

 

Future work

In the near future, I will be releasing my own little wrapper class for cURL which simplifies a couple of workflows involving cookies and headers. It will make it easy to perform some types of requests with very few lines of code.

Something I need to investigate a little further is a small memory leak that occurs when I grab the content: dynamic_cast(elements[0])->get_content(). On my computer, it seems to range between 16-64 bytes lost. It may be a problem with libxml++ or just a false alarm by Valgrind.

Finally, I may consider following up on that mailing list post to see if we can get the HTML parser included in libxml++.

9 thoughts on “Parsing HTML with C++: Revisited”

  1. I have been using gumbo-parser. It works relatively well, I will have to give this a try. Thanks for the great article. Here’s an example of using gumbo-parser if you are interested…

    #include <iostream>
    #include <cstring>
    #include <curl/curl.h>
    #include <vector>
    #include <unistd.h>
    #include "gumbo.h"
    
    using namespace std;
    
    CURL *curl;
    CURLcode res;
    
    int total = 0, current = 0;
    
    string data;
    
    vector<string> jokes;
    
    static size_t callback(void *data, size_t size, size_t nmemb, void *pointer) {
        ( (string*)pointer)->append((char*)data, size * nmemb);
        return size * nmemb;
    }
    
    void extract_links( GumboNode* node )
    {
        GumboAttribute* detail;
        if (node->type != GUMBO_NODE_ELEMENT) {
            return;
        }
        if( node->v.element.tag == GUMBO_TAG_DIV && (detail = gumbo_get_attribute(&node->v.element.attributes, "class" ) ) )
        {
            if( strstr( detail->value, "joke-box-upper") != NULL )
            {
                GumboNode* child = static_cast<GumboNode*>(node->v.element.children.data[0]);
    
                if( child->v.text.text != NULL ) {
                    jokes.push_back( child->v.text.text );
                }
            }
        }
        GumboVector* children = &node->v.element.children;
        for (unsigned int i = 0; i < children->length; ++i) {
            extract_links(static_cast<GumboNode*>(children->data[i]));
        }
    }
    
    int initialize_curl( const char* url )
    {
        curl_global_init(CURL_GLOBAL_ALL);
        curl = curl_easy_init();
        if(curl) {
            curl_easy_setopt(curl, CURLOPT_URL, url);
        	curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, callback);
        	curl_easy_setopt(curl, CURLOPT_WRITEDATA, &data);
        	curl_easy_setopt(curl, CURLOPT_USERAGENT,
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36");
        	curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
            curl_easy_setopt(curl, CURLOPT_COOKIEFILE, "cookies.txt");
            curl_easy_setopt(curl, CURLOPT_COOKIEJAR, "cookies.txt");
            res = curl_easy_perform(curl);
            if(res != CURLE_OK) {
                curl_easy_strerror(res);
                return 1;
            }
            curl_easy_cleanup(curl);
        }
        return 0;
    }
    
    void cleanup() {
        data.clear(); jokes.clear(); current = 0; }
    
    int main (int argc, char *argv[])
    {
        string out;
    
        while( total < 235 )
        {
            char* buffer = new char[512];
    
            sprintf( buffer, "http://www.joke-db.com/c/all/clean/page:%i/sort:score/direction:desc", total);
    
            if( initialize_curl( buffer) == 0 )
            {
                out = data;
    
                GumboOutput* output = gumbo_parse(out.c_str());
    
                extract_links(output->root);
    
                for(vector<int>::size_type i = 0; i != jokes.size(); i++) {
                    printf("Current joke: %s\r\n", jokes[current].c_str() );
                    current++;
                }
    
                out.clear();
    
                delete[] buffer;
    
                cleanup();
            }
    
            total++;
    
            printf("Going to page... %i\r\n", total );
    
            sleep(1);
        }
    
        printf("Complete!");
    
        getchar();
    
        return 0;
    }
    
  2. Thanks for this article. the best I have found today.

    A little error line 46 :
    std::string xpath = “//*[@id=”locator”]/p[1]/b/font/text()”;
    is :
    std::string xpath = “//*[@id=\”locator\”]/p[1]/b/font/text()”;

  3. i find hare more dependices like libiconv that libxml search.
    one liberary call next liberary.
    I have windows 7 and vs.
    dont wana build more then 6-8 libs and not all libs build with caple cliks, need read some long guides to konfigure some liberary builds for windows
    …no thx to pars caple simple html this suck….

    1. The REST SDK by Microsoft is of course a nice library.
      If you don’t need the asynchronous part of it then one should maybe also mention the “C++ Requests Library” (cpr). It’s based on libCurl with a in my opinion really nice and easy to use API.

      https://github.com/whoshuu/cpr

Leave a Reply to MostThingsWeb Cancel reply