I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.
My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.
The only decent sources of information I found were these three questions from Stack Overflow: Library Recommendation: C++ HTML Parser, Parse html using C, and XML Parser for C. Below is a summary of what I considered along with my take on each:
- QWebElement – Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)
- htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.
- Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.
- Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.
My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.
Here’s some sample code demonstrating Tidy and libxml++.
Step 1: Using Tidy to clean HTML and convert it to XML:
#include <tidy/tidy.h> #include <tidy/buffio.h> std::string CleanHTML(const std::string &html){ // Initialize a Tidy document TidyDoc tidyDoc = tidyCreate(); TidyBuffer tidyOutputBuffer = {0}; // Configure Tidy // The flags tell Tidy to output XML and disable showing warnings bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes) && tidyOptSetBool(tidyDoc, TidyQuiet, yes) && tidyOptSetBool(tidyDoc, TidyNumEntities, yes) && tidyOptSetBool(tidyDoc, TidyShowWarnings, no); int tidyResponseCode = -1; // Parse input if (configSuccess) tidyResponseCode = tidyParseString(tidyDoc, html.c_str()); // Process HTML if (tidyResponseCode >= 0) tidyResponseCode = tidyCleanAndRepair(tidyDoc); // Output the HTML to our buffer if (tidyResponseCode >= 0) tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer); // Any errors from Tidy? if (tidyResponseCode < 0) throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode); // Grab the result from the buffer and then free Tidy's memory std::string tidyResult = (char*)tidyOutputBuffer.bp; tidyBufFree(&tidyOutputBuffer); tidyRelease(tidyDoc); return tidyResult; }
Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML
first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.
#include <libxml++/libxml++.h> xmlpp::DomParser doc; // 'response' contains your HTML doc.parse_memory(CleanHTML(response)); xmlpp::Document* document = doc.get_document(); xmlpp::Element* root = document->get_root_node(); xmlpp::NodeSet elemns = root->find("descendant-or-self::*[@id = 'some_id']"); std::cout << elemns[0]->get_line() << std::endl; std::cout << elemns.size() << std::endl;
Important note about namespaces
Something that took me a while to figure out is that libxml++ requires the full namespace when selecting tags.
This won’t work: *p/text()
But this will: *[local-name() = 'p']/text()
You could also specify the full namespace manually, but I find that local-name()
is a much better option.
HTML5
Tidy by default doesn’t seem to support HTML5. For a version of Tidy that does, see here: http://w3c.github.io/tidy-html5/
More info
To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy
. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)
See the libxml++ class references: http://developer.gnome.org/libxml++/stable/annotated.html
Could you post the full source code example of this?
Thanks
Hi Mark – The code is pretty much usable as is. To convert this into a complete example, all you need to do is throw Step 2 into a ‘main’ method (excluding the #include line, of course), with Step 1 code prepended to the top of the file.
Hello, maybe I am too newbie of this topic, but, how is it handled the case when one has a formatting tag inside another?
Example:
.
How is it parsed into XML?
Hi Alberto,
The b tag is just a child of the a tag, so you can write XPath to select it.
This should do it:
descendant-or-self::[local-name() = ‘a’]/following-sibling::*[local-name() = ‘b’ and (position() = 1)]
That will select all b tags that are direct descendants of a tags.
Chris
Hi,
my concern is that there is both bold and non-bold text. Then, there is some content plus a child. How is this compliant with XML? And how does one recognize separately the part of the content which is outside the inner tag ( in this case) and that inside it?
Ah, I see. With XPath, to select the text itself, you need to use the text() function. This is because XPath treats text as an additional node. This is untested, but I think if you want to select the “Text plus ” text, you can use:
descendant-or-self::[local-name() = ‘a’]/text()
and if you want to select the “bold text” try:
descendant-or-self::[local-name() = ‘a’]/following-sibling::*[local-name() = ‘b’ and (position() = 1)]/text()
As for whether it’s actually valid XML, I’m not sure but I don’t think you’ll have any trouble parsing it. If you do have trouble, let me know.
Chris
Not sure if you’re still replying to this, but do you have the HTML example?
xmlpp requires a DTD, so I used http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
Now it’s throwing errors:
terminate called after throwing an instance of ‘xmlpp::validity_error’
what():
Validity error:
Line 289, column 7 (error):
Value “center” for attribute align of img is not among the enumerated set
Line 296, column 20 (error):
No declaration for attribute data-form-id of element a
Line 569, column 3 (error):
Element img does not carry attribute alt
I should mention that DTD was the DTD used by the HTML page I’m trying to parse.
Looking at my project from when I first used this code, I never had to specify a DTD. Starting with an instance of the DomParser (
xmlpp::DomParser doc;
), I simply writedoc.parse_memory(CleanHTML(response));
. As long as you are first passing your HTML to Tidy to clean it up, I wouldn’t think xmlpp would throw any errors.Well, the DTD was specified in the HTML page I was working on, which threw off the parser.
If you’re interested in my conclusion for all my weird issues, I wrote a post too (I linked to here as well)
http://s0beit.me/cpp/parsing-html-with-c-with-extra-utf-8-woes/
I just took a look – thanks for the pingback!
when i compile this doesn’t recognize the file “tiny.h ” and “buffio.h” :
and i have them in the same folder that i’m compiling .
You need to install the library (libtidy) either by downloading an installer for it (on Windows) or installing it from a repository (Linux).
Hello and thanks for your tutorial ! I’m having an issue with TidyHTML though… I’m using it as a library, as you do, and I’m having trouble “tidying” those new HTML5 tags. , , and the like…
How could I overcome this ? Should I add those tags, and if yes, how would I do that ? Be as precise as you can please, for I am not a smart man 😀
Thanks again for all this !
Just a follow-up: according to the API documentation, tidyOptSetValue(tidyDoc, TidyBlockTags, “header”) should do the trick… But it’s not. What am I missing ? Is that a whole lot more complicated and do I also have to create those in tidyenum.h and tags.c ?
Thanks by advance for your help !
You’re welcome! Looks like there is a fork of Tidy for HTML5: http://w3c.github.io/tidy-html5/. That’s where I’d start.
Okay yeah, definitely had to check my shit before making these comments. Good thing there’s moderation heh? Anyway, I was dumb, all the “new tags” should be put in the same spot separated by commas. Sorry for the spam – maybe you could add this how-to-add-new-tags thing in the article itself ?
Anyway thanks a lot again!
No problem – I’m not really that familiar with Tidy, beyond using it how I do in this article. I will add a link to the HTML5 port though.
Hi ! I have a question if you still look after this thread. I’m kind of a “noob” in this kind of programmation. So here’s my question… I have used libcurl to get an html file. It’s now in the root of the project I’m coding and I don’t understand how to use your code to just convert my (for example) “file.html” into a xhtml format that I can later parse. The fact is I don’t understand where or wich, in your program, file do you convert, and what kind of output you get. Do you get an url to put somewhere in the programm and get a straight string in return, or just an “.html” file that get converted ? That the part I’m stuck on. And where do I specify the file I want o be converted…
Hope you really can help me, thanks !
Hi,
The first function I define, CleanHTML, takes the contents of the HTML file itself. So, you just need to open the file and read its contents into a string. Then pass that to CleanHTML, which will produce XML. From there, just use the code in “Step 2” to actually parse the XML into a tree that you can traverse.
Hope this helps
Thanks, that helps a lot !
I have one more question, but only dor the sake of my curiosity. What would happened if you would do something like CleanHTML(response) in the main method without the rest of step 2 and left it as is. Would anything happen ? If yes, when would the buffer clean itself ?
Thanks again for your time.
What version of the tidy lib did you used ?
I get several errors like : “undefined reference to ‘[email protected]′” or “undefined reference to ‘[email protected]’.
Sorry to bother you again and thanks in advance for any help you can give me !
I used that curl library recently… By “parsing” the html file you meanr remove all the html tags and get only the text and links of the html?
By that I mean look algorythmicaly into a wab page to find the desired information and then “extract it”. My programm need to change in serial the name of a list of files from a alphanumerical code to an actual “name”. And I know a site on which I can find the desired information from just typing the proper url with the alphanumerical code. I than want to take in a string the proper name to than rename the file. Hope this reply to your question.
Tidy – … has a built-in SAX-style parser. (!!WRONG)
tidyBufAppend( out, in, r ); is not sax style parsing, it’s just adding bytes to buffer;
in fact you can see then in main()
err = tidyParseBuffer(tdoc, &docbuf);
which will parse the whole preloaded buffer
so please don’t write certain things and confuse people…
tnx
Where exactly did I imply that tidyBufAppend had anything to do with SAX?
If you clicked the link I provided (http://curl.haxx.se/libcurl/c/htmltidy.html) you would see what I was actually talking about. It’s the code beginning here:
Can you please upload the project and its dependent projects such as glibmm and the gtk includes for the all the msvc project to compile properly.
This would probably save everyone a huge amount of time gathering the right version from the internet.
Regards
Cedric
Hi Cedric – Please see my follow-up article here: http://www.mostthingsweb.com/2014/11/parsing-html-c-revisited/. It uses one less dependency and is simpler than the previous code. I don’t have a MSVC version yet, but if there is enough interest I will work on it.
Chris