· Waidanian ·

content generation

Selection of nodes in narrative HTML documents: Needledrop Level 0

· work-in-progress ·

While working as a teaching assistant in an introductory programming course designed for Computer Science and Software Engineering students, I realized that students often struggle to grasp the practical importance of recursion when it is first taught to them using binary search trees and suchlike. I designed a new exercise about writing recursive functions that parse narrative XHTML documents: precisely the kind of recursive data structure that most programmers are bound to encounter in their working lives, whether their language is low- or high-level and whether they work with numeric or symbolic data. Here is an impression of the resulting module, along with an earnest proposition to write a new module implementing the same algorithms within an industry-standard framework.

The code presented below is not exactly the same as that developed for the course, and out of respect for the CS department’s policy I will not reproduce the actual document handed out to students, which gave them the proper context and materials along with the functional specs to fulfill. The course material has not been released under an open license (to the best of my knowledge), so I will not apply such a license to this present page. Instead, I am inviting the reader to check back on this site for the upcoming announcement of a separate public GitHub project in a similar application domain – a project that has been inspired by this ‘Level 0’ module, but with more complete functionality and real applicability. The code below could probably be useful, but I will not make a separate repository out of it. What I am offering here is a report on a pedagogical project, the analysis of which could be very useful for teachers and learners alike.

Special thanks go to Jean-Francis Roy, who taught the course and with whom I worked as a TA. He developed an excellent course, supported my initiative and edited my code, in particular helping me figure out what I actually wanted to write in the first_descendant function shown below.

Here is the module’s preamble:

# This program requires Python 3.6 or newer, as it uses the new # formatted string literals, type annotations and other goodies. import xml.etree.ElementTree as ET from typing import Union

We are working with the ElementTree module from the standard library. This is a refreshing alternative to the Document Object Model for random-access into a whole XML document, with a few quirks. In particular, the module cleverly only needs two classes for the actual XML document: one for the whole tree and another for elements with all their associated data. In particular, because everything is attached to element instances, each ‘element’ is actually a structure consisting of the XML element, its attributes as a dictionary, the first child text node if there is one, a list of children elements and the following-sibling text node if there is one. See what this entails? It is somehow reminiscent of Lisp’s car and cdr (oops sorry I mean head and tail ;-) ) constructs. Very pedagogical indeed.

When this was taught, Python 3.6 had not yet been released. Now that it has been, I joyfully add type annotations whenever I can. I appreciate dynamic typing, but I find it very useful for learning purposes that functions and variables can now be annotated with a clear, uniform syntax that does not run contrary to Python’s core syntax. Even though no programmatic use is made of the annotations here, I find them helpful for reasoning about what is being done in each block.

Extracting the text content

The module’s functions accomplish work that is actually useful, but they were conceived first and foremost for their pedagogical value. The function below accomplishes a task that is similarly carried out by a pre-existing function in the ElementTree module: itertext. However, the student is not needlessly replicating work, because I decided to go further and require that certain essential attribute values be included in the text content sequence; itertext does not do that and I don’t know if it would be extensible to do so. My function also requires the student to learn what document order means.

def content_sequence(element: ET.Element) -> str: content: str = '' if element.text is not None: content += element.text for child in element: content += content_sequence(child) content += phraseable_attributes(element) if element.tail is not None: content += element.tail return content

As a useful extra, a simple function appends the values of certain attributes that could be part of the phrasing content’s text nodes. We are not setting out to write a HTML-to-Markdown converter, but this little function alone gives us an idea of what that would entail.

def phraseable_attributes(element: ET.Element) -> str: phraseable_attr: str = '' if 'title' in element.attrib.keys(): phraseable_attr += ' ({})'.format(element.attrib['title']) if ( element.tag == '{http://www.w3.org/1999/xhtml}a' and 'href' in element.attrib.keys() ): phraseable_attr += ' [{}]'.format(element.attrib['href']) if ( element.tag == '{http://www.w3.org/1999/xhtml}img' and 'alt' in element.attrib.keys() ): phraseable_attr += ' [img alt: {}]'.format(element.attrib['alt']) return phraseable_attr

Selecting the first descendant by name

The following function can be considered equivalent to the CSS selector ‘parent descendant:first-of-type’. This is very useful, for example in situations where one needs to extract a content block without having to predict exactly at what depth it lies. The ElementTree module contains some minimal XPath support, but I have explicitly avoided that because in my experience, it confuses people and makes them think that selecting XML nodes is an esoteric operation that can only be carried out by specialized declarative languages. Here, we are working with a general-purpose imperative programming language and yet the function is elegant and simple!

def first_descendant(parent: ET.Element, descendant_qname: str) -> Union[ET.Element, None]: if len(parent) == 0: return for child in parent: if child.tag == descendant_qname: return child inner = first_descendant(child, descendant_qname) if inner is not None: return inner

This is a good starting point for working with narrative XML in real scenarios. Of course, you will tell me that in real working situations the functions should be methods defined in a new class that inherits from ET.Element, and I agree with you! However, I had initially defined my module in this simplistic way because the students were being taught recursion with functions generally defined at the module level. This is legitimate in the sense that Python can also be used in a functional style, in which the use of classes would be more conservative and instead higher-order functions would be defined, for instance. I would have liked going so far as to introduce some classic higher-order functions, but unfortunately there was no time (semesters always seem so short when teaching).