Apr 112015
 

Some motivation

Django has a field called ChoiceField which lets you very easily create a <select> dropdown in your forms. A great feature is the ability to transform nested choices into appropriate <optgroup> groups. For instance, the following code:

my_choices = [("Group 1", [
                 (1, "Choice 1"), (2, "Choice 2")]), 
              ("Group 2", 
                 [(3, "Choice 3"), (4, "Choice 4")]), 
              (5, "Choice 5")]
field = ChoiceField(label="Field", choices=my_choices)

…creates a field that when rendered looks something like this:

drop

Looks nice

 

The HTML that Django produced looks like this (excluding the label):

<select id="id_test" name="test">
    <optgroup label="Group 1">
        <option value="1">Choice 1</option>
        <option value="2">Choice 2</option>
    </optgroup>
    <optgroup label="Group 2">
        <option value="3">Choice 3</option>
        <option value="4">Choice 4</option>
    </optgroup>
    <option value="5">Choice 5</option>
</select>

Nice. So, what happens when we try to add more levels of nesting in the choices?

my_choices = [
    ("Group 1", [
        (0, "Choice 0"),
        ("Subgroup 1", [
            (1, "Choice 1")]),
        ("Subgroup 2", [
            (2, "Choice 2")])]),
    ("Group 2",[
        ("Subgroup 3", [
            (3, "Choice 3")]),
        ("Subgroup 4", [(4, "Choice 4")])]),
     (5, "Choice 5")]
drop2

Bad, bad, bad

 

That didn’t work. Here’s the HTML Django gave us:

<select id="id_test" name="test">
    <optgroup label="Group 1">
        <option value="0">Choice 0</option>
        <option value="Subgroup 1">[(1, &#39;Choice 1&#39;)]</option>
        <option value="Subgroup 2">[(2, &#39;Choice 2&#39;)]</option>
    </optgroup>
    <optgroup label="Group 2">
        <option value="Subgroup 3">[(3, &#39;Choice 3&#39;)]</option>
        <option value="Subgroup 4">[(4, &#39;Choice 4&#39;)]</option>
    </optgroup>
    <option value="5">Choice 5</option>
</select>

 

The problem

As it turns out, ChoiceField only supports one level of nesting. Five years ago someone submitted a patch to correct the problem, but it was rejected for good reason: HTML itself only officially supports one level of <optgroup> nesting.

To see if it would work anyway, I manually applied the patch to my Django installation. Although the patch itself worked (Django produced the nested option groups), Chrome wasn’t having any of it. My dropdown displayed incorrectly, and developer tools showed that the HTML got mangled during parsing.
 

Emulating nested optgroups

My solution to the problem is to fake it by using disabled <option> elements as group headers. Then, we apply indentation to the actual options to make them appear to belong to the groups. This is accomplished by a custom widget called NestedSelect (below), which is a cross between the original Select widget and the patch I linked to above.

from itertools import chain
from django.forms.widgets import Widget
from django.utils.encoding import force_text
from django.utils.html import format_html
from django.utils.safestring import mark_safe
from django.forms.utils import flatatt

class NestedSelect(Widget):
    allow_multiple_selected = False

    def __init__(self, attrs=None, choices=()):
        super(NestedSelect, self).__init__(attrs)
        # choices can be any iterable, but we may need to render this widget
        # multiple times. Thus, collapse it into a list so it can be consumed
        # more than once.
        self.choices = list(choices)

    def render(self, name, value, attrs=None, choices=()):
        final_attrs = self.build_attrs(attrs, name=name)
        output = [format_html('<select{}>', flatatt(final_attrs))]
        selected_choices = set(force_text(v) for v in [value])
        options = 'n'.join(self.process_list(selected_choices, chain(self.choices, choices)))
        if options:
            output.append(options)
        output.append('</select>')
        return mark_safe('n'.join(output))

    def render_group(self, selected_choices, option_value, option_label, level=0):
        padding = "&nbsp;" * (level * 4)
        output = format_html("<option disabled>%s{}</option>" % padding, force_text(option_value))
        output += "".join(self.process_list(selected_choices, option_label, level + 1))
        return output

    def process_list(self, selected_choices, l, level=0):
        output = []
        for option_value, option_label in l:
            if isinstance(option_label, (list, tuple)):
                output.append(self.render_group(selected_choices, option_value, option_label, level))
            else:
                output.append(self.render_option(selected_choices, option_value, option_label, level))
        return output

    def render_option(self, selected_choices, option_value, option_label, level):
        padding = "&nbsp;" * (level * 4)
        if option_value is None:
            option_value = ''
        option_value = force_text(option_value)
        if option_value in selected_choices:
            selected_html = mark_safe(' selected="selected"')
            if not self.allow_multiple_selected:
                # Only allow for a single selection.
                selected_choices.remove(option_value)
        else:
            selected_html = ''
        return format_html('<option value="{}"{}>%s{}</option>' % padding, option_value, selected_html,
                           force_text(option_label))

Then it’s just a matter of swapping out the widget that ChoiceField uses:

self.fields["test"] = ChoiceField(label="Field", choices=my_choices, 
     widget=NestedSelect)

Now we get something closer to what we wanted:

drop3
 

Improving the aesthetics

The solution above works, but it can be improved in two ways:

  1. The manual addition of padding results in an ugly side-effect – that padding is visible in the selected option. Look at the difference when Choice 1 and Choice 5 are selected (below). Choice 5 is aligned correctly, but Choice 1 appears to be floating out in no-man’s land.

1 2

  1. We no longer have the nice bolding effect on the option group headers. It’s not possible for us to individually style the elements that are supposed to be headers.

To overcome these shortcomings of HTML, we must turn a JavaScript based replacement: Select2. With that plugin installed, the fix is just a little bit of JavaScript:

$(function () {
    $("select").each(function(){
        // Ensure the final select box is wide enough to fit the bolded headings
        $(this).width($(this).width() * 1.1);
    }).select2({
        templateSelection: function (selection) {
            return $.trim(selection.text);
        },
        templateResult: function (option) {
            if ($(option.element).attr("disabled")) {
                return $("<b>" + option.text + "</b>").css("color", "black");
            } else {
                return option.text;
            }
        }
    });
});

 

Finished product

The final results look nice, and you even get a free search box thanks to Select2:

222

 Posted by at 5:03 pm
Nov 292014
 

About a year ago I published an article entitled Parsing HTML with C++. It is by far my most popular article (second most popular being this one), and is a top result on Google for queries such as “html c++ parsing”. Nevertheless there is always room for improvement. Today, I present a revisit of the topic including a simpler way to parse, as well as a self-contained ready to go example (which many people have been asking me for.)
 

Old solution

Before today, my prescription for HTML parsing in C++ was a combination of the following libraries and associated wrappers:

cURL, of course, is needed to perform HTTP requests so that we have something to parse. Tidy was used to transform the HTML into XML that was then consumed by libxml2. libxml2 provided a nice DOM tree that is traversable with XPath expressions.
 

Shortcomings

This kludge presents a number of problems, with the primary one being no HTML5 support. Tidy doesn’t support HTML5 tags, so when it encounters one, it chokes. There is a version of Tidy in development that is supposed to support HTML5, but it is still experimental.

But the real sore point is the requirement to convert the HTML into XML before feeding it to libxml2. If only there was a way for libxml2 to consume HTML directly… Oh, wait.

At the time, I hadn’t realized that libxml2 actually had a built in HTML parser. I even found a message on the mailing list from 2004 giving a sample class that encapsulates the HTML parser. Seeing as though the last message posted was also in 2004, I suppose that there isn’t much interest.
 

New solution

With knowledge of the native HTML parser in hand, we can modify the old solution to completely remove libtidy from the mix. libxml2 by default isn’t happy with HTML5 tags either, but we can fix this by silencing errors (HTML_PARSE_NOERROR) and relaxing the parser (HTML_PARSE_RECOVER).

The new solution, then, requires solely cURL, libxml2, and their associated wrappers.

Below is a self-contained example that visits iplocation.net to acquire the external IP address of the current computer:

#include <libxml/tree.h>
#include <libxml/HTMLparser.h>
#include <libxml++/libxml++.h>

#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>

#include <iostream>
#include <string>

#define HEADER_ACCEPT "Accept:text/html,application/xhtml+xml,application/xml"
#define HEADER_USER_AGENT "User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17"

int main() {
    std::string url = "http://www.iplocation.net/";
	curlpp::Easy request;

	// Specify the URL
	request.setOpt(curlpp::options::Url(url));

	// Specify some headers
	std::list<std::string> headers;
	headers.push_back(HEADER_ACCEPT);
	headers.push_back(HEADER_USER_AGENT);
	request.setOpt(new curlpp::options::HttpHeader(headers));
    request.setOpt(new curlpp::options::FollowLocation(true));

	// Configure curlpp to use stream
	std::ostringstream responseStream;
	curlpp::options::WriteStream streamWriter(&responseStream);
	request.setOpt(streamWriter);

	// Collect response
    request.perform();
    std::string re = responseStream.str();

    // Parse HTML and create a DOM tree
    xmlDoc* doc = htmlReadDoc((xmlChar*)re.c_str(), NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Encapsulate raw libxml document in a libxml++ wrapper
    xmlNode* r = xmlDocGetRootElement(doc);
    xmlpp::Element* root = new xmlpp::Element(r);

    // Grab the IP address
    std::string xpath = "//*[@id=\"locator\"]/p[1]/b/font/text()";
    auto elements = root->find(xpath);
    std::cout << "Your IP address is:" << std::endl;
    std::cout << dynamic_cast<xmlpp::ContentNode*>(elements[0])->get_content() << std::endl;

    delete root;
    xmlFreeDoc(doc);

    return 0;
}

Install prerequisites and compile like this (Linux):

sudo apt-get install libcurlpp-dev libxml++2.6-dev
g++ main.cpp -lcurlpp -lcurl -g -pg `xml2-config --cflags --libs` `pkg-config libxml++-2.6 --cflags --libs` --std=c++0x
./a.out

 

Future work

In the near future, I will be releasing my own little wrapper class for cURL which simplifies a couple of workflows involving cookies and headers. It will make it easy to perform some types of requests with very few lines of code.

Something I need to investigate a little further is a small memory leak that occurs when I grab the content: dynamic_cast(elements[0])->get_content(). On my computer, it seems to range between 16-64 bytes lost. It may be a problem with libxml++ or just a false alarm by Valgrind.

Finally, I may consider following up on that mailing list post to see if we can get the HTML parser included in libxml++.

 Posted by at 9:15 pm
Feb 262013
 

I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.

Revisited! Please see the new article here.


My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.

The only decent sources of information I found were these three questions from Stack Overflow: Library Recommendation: C++ HTML ParserParse html using C, and XML Parser for C. Below is a summary of what I considered along with my take on each:

  • QWebElement – Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)
  • htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.
  • Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.
  • Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.

My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.

Here’s some sample code demonstrating Tidy and libxml++.
 

Step 1: Using Tidy to clean HTML and convert it to XML:

#include <tidy/tidy.h>
#include <tidy/buffio.h>

std::string CleanHTML(const std::string &html){
    // Initialize a Tidy document
    TidyDoc tidyDoc = tidyCreate();
    TidyBuffer tidyOutputBuffer = {0};

    // Configure Tidy
    // The flags tell Tidy to output XML and disable showing warnings
    bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes)
        && tidyOptSetBool(tidyDoc, TidyQuiet, yes)
        && tidyOptSetBool(tidyDoc, TidyNumEntities, yes)
        && tidyOptSetBool(tidyDoc, TidyShowWarnings, no);

    int tidyResponseCode = -1;

    // Parse input
    if (configSuccess)
        tidyResponseCode = tidyParseString(tidyDoc, html.c_str());

    // Process HTML
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidyCleanAndRepair(tidyDoc);

    // Output the HTML to our buffer
    if (tidyResponseCode >= 0)
        tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer);

    // Any errors from Tidy?
    if (tidyResponseCode < 0)
        throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode);

    // Grab the result from the buffer and then free Tidy's memory
    std::string tidyResult = (char*)tidyOutputBuffer.bp;
    tidyBufFree(&tidyOutputBuffer);
    tidyRelease(tidyDoc);

    return tidyResult;
}

 

Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.

#include <libxml++/libxml++.h>

xmlpp::DomParser doc;

// 'response' contains your HTML
doc.parse_memory(CleanHTML(response));

xmlpp::Document* document = doc.get_document();
xmlpp::Element* root = document->get_root_node();

xmlpp::NodeSet elemns = root->find("descendant-or-self::*[@id = 'some_id']");
std::cout << elemns[0]->get_line() << std::endl;
std::cout << elemns.size() << std::endl;

 
Important note about namespaces
Something that took me a while to figure out is that libxml++ requires the full namespace when selecting tags.

This won’t work: *p/text()
But this will: *[local-name() = 'p']/text()

You could also specify the full namespace manually, but I find that local-name() is a much better option.

HTML5
Tidy by default doesn’t seem to support HTML5. For a version of Tidy that does, see here: http://w3c.github.io/tidy-html5/

More info

To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)


See the libxml++ class references: http://developer.gnome.org/libxml++/stable/annotated.html

Revisited! Please see the new article here.

 Posted by at 2:42 pm
May 292012
 

I was looking through my referrer stats the other day and found a neat use of my port of css2xpathhttp://chrispebble.com/inlining-a-css-stylesheet-with-c/.

The author uses css2xpath to aid in inlining CSS stylesheets. Here is an excerpt of the project (I trimmed the first two paragraphs):

This week I worked on adding confirmation emails to our quiz system and was introduced to the maddeningly fickle world of HTML emails…

I had hoped that the detailed confirmation emails could simply re-use the HTML that our system already generated, but after a few test emails it became apparent that most of that HTML relied heavily on external stylesheets which no email client would render. With mounting dread I foresaw the likelihood that I would have to write another rendering engine for emails rather than re-using the one we already had…

Willing to do almost anything to avoid this code duplication I wondered if I could take the HTML rendered by the existing quiz engine and just tweak it to get it work in our emails. So how hard would it be to write some code that took these external stylesheets and applied them as inline rules (which is supported to some degree by most email clients)? Not hard at all as it turns out.

Read about the full project here: http://chrispebble.com/inlining-a-css-stylesheet-with-c/

 Posted by at 5:32 pm