About a year ago I published an article entitled Parsing HTML with C++. It is by far my most popular article (second most popular being this one), and is a top result on Google for queries such as “html c++ parsing”. Nevertheless there is always room for improvement. Today, I present a revisit of the topic including a simpler way to parse, as well as a self-contained ready to go example (which many people have been asking me for.)
Old solution
Before today, my prescription for HTML parsing in C++ was a combination of the following libraries and associated wrappers:
cURL, of course, is needed to perform HTTP requests so that we have something to parse. Tidy was used to transform the HTML into XML that was then consumed by libxml2. libxml2 provided a nice DOM tree that is traversable with XPath expressions.
Shortcomings
This kludge presents a number of problems, with the primary one being no HTML5 support. Tidy doesn’t support HTML5 tags, so when it encounters one, it chokes. There is a version of Tidy in development that is supposed to support HTML5, but it is still experimental.
But the real sore point is the requirement to convert the HTML into XML before feeding it to libxml2. If only there was a way for libxml2 to consume HTML directly… Oh, wait.
At the time, I hadn’t realized that libxml2 actually had a built in HTML parser. I even found a message on the mailing list from 2004 giving a sample class that encapsulates the HTML parser. Seeing as though the last message posted was also in 2004, I suppose that there isn’t much interest.
New solution
With knowledge of the native HTML parser in hand, we can modify the old solution to completely remove libtidy from the mix. libxml2 by default isn’t happy with HTML5 tags either, but we can fix this by silencing errors (HTML_PARSE_NOERROR
) and relaxing the parser (HTML_PARSE_RECOVER
).
The new solution, then, requires solely cURL, libxml2, and their associated wrappers.
Below is a self-contained example that visits iplocation.net to acquire the external IP address of the current computer:
#include <libxml/tree.h> #include <libxml/HTMLparser.h> #include <libxml++/libxml++.h> #include <curlpp/cURLpp.hpp> #include <curlpp/Easy.hpp> #include <curlpp/Options.hpp> #include <iostream> #include <string> #define HEADER_ACCEPT "Accept:text/html,application/xhtml+xml,application/xml" #define HEADER_USER_AGENT "User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17" int main() { std::string url = "http://www.iplocation.net/"; curlpp::Easy request; // Specify the URL request.setOpt(curlpp::options::Url(url)); // Specify some headers std::list<std::string> headers; headers.push_back(HEADER_ACCEPT); headers.push_back(HEADER_USER_AGENT); request.setOpt(new curlpp::options::HttpHeader(headers)); request.setOpt(new curlpp::options::FollowLocation(true)); // Configure curlpp to use stream std::ostringstream responseStream; curlpp::options::WriteStream streamWriter(&responseStream); request.setOpt(streamWriter); // Collect response request.perform(); std::string re = responseStream.str(); // Parse HTML and create a DOM tree xmlDoc* doc = htmlReadDoc((xmlChar*)re.c_str(), NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING); // Encapsulate raw libxml document in a libxml++ wrapper xmlNode* r = xmlDocGetRootElement(doc); xmlpp::Element* root = new xmlpp::Element(r); // Grab the IP address std::string xpath = "//*[@id=\"locator\"]/p[1]/b/font/text()"; auto elements = root->find(xpath); std::cout << "Your IP address is:" << std::endl; std::cout << dynamic_cast<xmlpp::ContentNode*>(elements[0])->get_content() << std::endl; delete root; xmlFreeDoc(doc); return 0; }
Install prerequisites and compile like this (Linux):
sudo apt-get install libcurlpp-dev libxml++2.6-dev g++ main.cpp -lcurlpp -lcurl -g -pg `xml2-config --cflags --libs` `pkg-config libxml++-2.6 --cflags --libs` --std=c++0x ./a.out
Future work
In the near future, I will be releasing my own little wrapper class for cURL which simplifies a couple of workflows involving cookies and headers. It will make it easy to perform some types of requests with very few lines of code.
Something I need to investigate a little further is a small memory leak that occurs when I grab the content: dynamic_cast
. On my computer, it seems to range between 16-64 bytes lost. It may be a problem with libxml++ or just a false alarm by Valgrind.
Finally, I may consider following up on that mailing list post to see if we can get the HTML parser included in libxml++.
I have been using gumbo-parser. It works relatively well, I will have to give this a try. Thanks for the great article. Here’s an example of using gumbo-parser if you are interested…
Glad it was helpful, and thanks for the snippet! I hadn’t actually seen Gumbo before.
Chris
Thanks for this article. the best I have found today.
A little error line 46 :
std::string xpath = “//*[@id=”locator”]/p[1]/b/font/text()”;
is :
std::string xpath = “//*[@id=\”locator\”]/p[1]/b/font/text()”;
i find hare more dependices like libiconv that libxml search.
one liberary call next liberary.
I have windows 7 and vs.
dont wana build more then 6-8 libs and not all libs build with caple cliks, need read some long guides to konfigure some liberary builds for windows
…no thx to pars caple simple html this suck….
If you’ve got any suggestions for a simpler Windows solution I’m all ears.
For a simpler solution – on Windows or Linux – use the open-source Rest API from Microsoft, which has curl-like functionality built in. This SDK uses the task processing library and C++ 11 features, and promises much better performance in an asynchronous world. Follow my article here: http://duncanmackay.me/blog/development/parsing-html-using-cpp/
Thanks for the original article.
Interesting, thanks!
The REST SDK by Microsoft is of course a nice library.
If you don’t need the asynchronous part of it then one should maybe also mention the “C++ Requests Library” (cpr). It’s based on libCurl with a in my opinion really nice and easy to use API.
https://github.com/whoshuu/cpr