C++ HTML

Parsing HTML with C++

I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.

Revisited! Please see the new article here.


My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.

The only decent sources of information I found were these three questions from Stack Overflow: Library Recommendation: C++ HTML ParserParse html using C, and XML Parser for C. Below is a summary of what I considered along with my take on each:

  • QWebElement – Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)
  • htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.
  • Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.
  • Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.

My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.

Here’s some sample code demonstrating Tidy and libxml++.
 

Step 1: Using Tidy to clean HTML and convert it to XML:

#include <tidy/tidy.h>
#include <tidy/buffio.h>

std::string CleanHTML(const std::string &html){
    // Initialize a Tidy document
    TidyDoc tidyDoc = tidyCreate();
    TidyBuffer tidyOutputBuffer = {0};

    // Configure Tidy
    // The flags tell Tidy to output XML and disable showing warnings
    bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes)
        && tidyOptSetBool(tidyDoc, TidyQuiet, yes)
        && tidyOptSetBool(tidyDoc, TidyNumEntities, yes)
        && tidyOptSetBool(tidyDoc, TidyShowWarnings, no);

    int tidyResponseCode = -1;

    // Parse input
    if (configSuccess)
        tidyResponseCode = tidyParseString(tidyDoc, html.c_str());

    // Process HTML
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidyCleanAndRepair(tidyDoc);

    // Output the HTML to our buffer
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer);

    // Any errors from Tidy?
    if (tidyResponseCode < 0)
        throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode);

    // Grab the result from the buffer and then free Tidy's memory
    std::string tidyResult = (char*)tidyOutputBuffer.bp;
    tidyBufFree(&tidyOutputBuffer);
    tidyRelease(tidyDoc);

    return tidyResult;
}

 

Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.

#include <libxml++/libxml++.h>

xmlpp::DomParser doc;

// 'response' contains your HTML
doc.parse_memory(CleanHTML(response));

xmlpp::Document* document = doc.get_document();
xmlpp::Element* root = document->get_root_node();

xmlpp::NodeSet elemns = root->find("descendant-or-self::*[@id = 'some_id']");
std::cout << elemns[0]->get_line() << std::endl;
std::cout << elemns.size() << std::endl;

 
Important note about namespaces
Something that took me a while to figure out is that libxml++ requires the full namespace when selecting tags.

This won’t work: *p/text()
But this will: *[local-name() = 'p']/text()

You could also specify the full namespace manually, but I find that local-name() is a much better option.

HTML5
Tidy by default doesn’t seem to support HTML5. For a version of Tidy that does, see here: http://w3c.github.io/tidy-html5/

More info

To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)


See the libxml++ class references: http://developer.gnome.org/libxml++/stable/annotated.html

Revisited! Please see the new article here.

30 thoughts on “Parsing HTML with C++”

    1. Hi Mark – The code is pretty much usable as is. To convert this into a complete example, all you need to do is throw Step 2 into a ‘main’ method (excluding the #include line, of course), with Step 1 code prepended to the top of the file.

  1. Hello, maybe I am too newbie of this topic, but, how is it handled the case when one has a formatting tag inside another?
    Example:

    <a href="#" rel="nofollow">Text plus <b>bold text</b></a>

    .
    How is it parsed into XML?

    1. Hi Alberto,

      The b tag is just a child of the a tag, so you can write XPath to select it.
      This should do it:

      descendant-or-self::[local-name() = ‘a’]/following-sibling::*[local-name() = ‘b’ and (position() = 1)]

      That will select all b tags that are direct descendants of a tags.

      Chris

      1. Hi,

        my concern is that there is both bold and non-bold text. Then, there is some content plus a child. How is this compliant with XML? And how does one recognize separately the part of the content which is outside the inner tag ( in this case) and that inside it?

  2. Ah, I see. With XPath, to select the text itself, you need to use the text() function. This is because XPath treats text as an additional node. This is untested, but I think if you want to select the “Text plus ” text, you can use:

    descendant-or-self::[local-name() = ‘a’]/text()

    and if you want to select the “bold text” try:

    descendant-or-self::[local-name() = ‘a’]/following-sibling::*[local-name() = ‘b’ and (position() = 1)]/text()

    As for whether it’s actually valid XML, I’m not sure but I don’t think you’ll have any trouble parsing it. If you do have trouble, let me know.

    Chris

  3. Not sure if you’re still replying to this, but do you have the HTML example?

    xmlpp requires a DTD, so I used http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

    Now it’s throwing errors:

    terminate called after throwing an instance of ‘xmlpp::validity_error’
    what():
    Validity error:
    Line 289, column 7 (error):
    Value “center” for attribute align of img is not among the enumerated set
    Line 296, column 20 (error):
    No declaration for attribute data-form-id of element a
    Line 569, column 3 (error):
    Element img does not carry attribute alt

    1. Looking at my project from when I first used this code, I never had to specify a DTD. Starting with an instance of the DomParser (xmlpp::DomParser doc;), I simply write doc.parse_memory(CleanHTML(response));. As long as you are first passing your HTML to Tidy to clean it up, I wouldn’t think xmlpp would throw any errors.

  4. when i compile this doesn’t recognize the file “tiny.h ” and “buffio.h” :
    and i have them in the same folder that i’m compiling .

  5. Hello and thanks for your tutorial ! I’m having an issue with TidyHTML though… I’m using it as a library, as you do, and I’m having trouble “tidying” those new HTML5 tags. , , and the like…
    How could I overcome this ? Should I add those tags, and if yes, how would I do that ? Be as precise as you can please, for I am not a smart man 😀

    Thanks again for all this !

    1. Just a follow-up: according to the API documentation, tidyOptSetValue(tidyDoc, TidyBlockTags, “header”) should do the trick… But it’s not. What am I missing ? Is that a whole lot more complicated and do I also have to create those in tidyenum.h and tags.c ?

      Thanks by advance for your help !

      1. Okay yeah, definitely had to check my shit before making these comments. Good thing there’s moderation heh? Anyway, I was dumb, all the “new tags” should be put in the same spot separated by commas. Sorry for the spam – maybe you could add this how-to-add-new-tags thing in the article itself ?

        Anyway thanks a lot again!

  6. Hi ! I have a question if you still look after this thread. I’m kind of a “noob” in this kind of programmation. So here’s my question… I have used libcurl to get an html file. It’s now in the root of the project I’m coding and I don’t understand how to use your code to just convert my (for example) “file.html” into a xhtml format that I can later parse. The fact is I don’t understand where or wich, in your program, file do you convert, and what kind of output you get. Do you get an url to put somewhere in the programm and get a straight string in return, or just an “.html” file that get converted ? That the part I’m stuck on. And where do I specify the file I want o be converted…
    Hope you really can help me, thanks !

    1. Hi,

      The first function I define, CleanHTML, takes the contents of the HTML file itself. So, you just need to open the file and read its contents into a string. Then pass that to CleanHTML, which will produce XML. From there, just use the code in “Step 2” to actually parse the XML into a tree that you can traverse.

      Hope this helps

        1. I have one more question, but only dor the sake of my curiosity. What would happened if you would do something like CleanHTML(response) in the main method without the rest of step 2 and left it as is. Would anything happen ? If yes, when would the buffer clean itself ?

          Thanks again for your time.

    2. I used that curl library recently… By “parsing” the html file you meanr remove all the html tags and get only the text and links of the html?

      1. By that I mean look algorythmicaly into a wab page to find the desired information and then “extract it”. My programm need to change in serial the name of a list of files from a alphanumerical code to an actual “name”. And I know a site on which I can find the desired information from just typing the proper url with the alphanumerical code. I than want to take in a string the proper name to than rename the file. Hope this reply to your question.

  7. Tidy – … has a built-in SAX-style parser. (!!WRONG)

    tidyBufAppend( out, in, r ); is not sax style parsing, it’s just adding bytes to buffer;
    in fact you can see then in main()
    err = tidyParseBuffer(tdoc, &docbuf);
    which will parse the whole preloaded buffer

    so please don’t write certain things and confuse people…
    tnx

  8. Can you please upload the project and its dependent projects such as glibmm and the gtk includes for the all the msvc project to compile properly.
    This would probably save everyone a huge amount of time gathering the right version from the internet.

    Regards

    Cedric

Leave a Reply