· Waidanian ·
content generation
· Waidanian ·
content generation
Selection of nodes in narrative HTML documents: Needledrop Level 0
· work-in-progress ·
While working as a teaching assistant in an introductory programming course designed for Computer Science and Software Engineering students, I realized that students often struggle to grasp the practical importance of recursion when it is first taught to them using binary search trees and suchlike. I designed a new exercise about writing recursive functions that parse narrative XHTML documents: precisely the kind of recursive data structure that most programmers are bound to encounter in their working lives, whether their language is low- or high-level and whether they work with numeric or symbolic data. Here is an impression of the resulting module, along with an earnest proposition to write a new module implementing the same algorithms within an industry-standard framework.
The code presented below is not exactly the same as that developed for the course, and out of respect for the CS department’s policy I will not reproduce the actual document handed out to students, which gave them the proper context and materials along with the functional specs to fulfill. The course material has not been released under an open license (to the best of my knowledge), so I will not apply such a license to this present page. Instead, I am inviting the reader to check back on this site for the upcoming announcement of a separate public GitHub project in a similar application domain – a project that has been inspired by this ‘Level 0’ module, but with more complete functionality and real applicability. The code below could probably be useful, but I will not make a separate repository out of it. What I am offering here is a report on a pedagogical project, the analysis of which could be very useful for teachers and learners alike.
Special thanks go
to Jean-Francis
Roy, who taught the course and with whom I worked as a
TA. He developed an excellent course, supported my
initiative and edited my code, in particular helping me
figure out what I actually wanted to write in
the first_descendant
function shown below.
Here is the module’s preamble:
# This program requires Python 3.6 or newer, as it uses the new
# formatted string literals, type annotations and other goodies.
import xml.etree.ElementTree as ET
from typing import Union
We are working with the ElementTree module from the
standard library. This is a refreshing alternative to the
Document Object Model for random-access into a whole XML
document, with a few quirks. In particular, the module
cleverly only needs two classes for the actual XML document:
one for the whole tree and another for elements with all
their associated data. In particular, because everything is
attached to element instances, each ‘element’ is actually a
structure consisting of the XML element, its attributes as a
dictionary, the first child text node if there is one, a
list of children elements and the following-sibling text
node if there is one. See what this entails? It is somehow
reminiscent of Lisp’s car
and
cdr
(oops sorry I mean head
and tail
;-)
) constructs. Very pedagogical indeed.
When this was taught, Python 3.6 had not yet been released. Now that it has been, I joyfully add type annotations whenever I can. I appreciate dynamic typing, but I find it very useful for learning purposes that functions and variables can now be annotated with a clear, uniform syntax that does not run contrary to Python’s core syntax. Even though no programmatic use is made of the annotations here, I find them helpful for reasoning about what is being done in each block.
Extracting the text content
The module’s functions accomplish work that is actually
useful, but they were conceived first and foremost for their
pedagogical value. The function below accomplishes a task that
is similarly carried out by a pre-existing function in the
ElementTree module: itertext
. However,
the student is not needlessly replicating work, because I
decided to go further and require that certain essential
attribute values be included in the text content
sequence; itertext
does not do that and I don’t
know if it would be extensible to do so. My function also
requires the student to learn what document order
means.
def content_sequence(element: ET.Element) -> str:
content: str = ''
if element.text is not None:
content += element.text
for child in element:
content += content_sequence(child)
content += phraseable_attributes(element)
if element.tail is not None:
content += element.tail
return content
As a useful extra, a simple function appends the values of certain attributes that could be part of the phrasing content’s text nodes. We are not setting out to write a HTML-to-Markdown converter, but this little function alone gives us an idea of what that would entail.
def phraseable_attributes(element: ET.Element) -> str:
phraseable_attr: str = ''
if 'title' in element.attrib.keys():
phraseable_attr += ' ({})'.format(element.attrib['title'])
if ( element.tag == '{http://www.w3.org/1999/xhtml}a'
and 'href' in element.attrib.keys() ):
phraseable_attr += ' [{}]'.format(element.attrib['href'])
if ( element.tag == '{http://www.w3.org/1999/xhtml}img'
and 'alt' in element.attrib.keys() ):
phraseable_attr += ' [img alt: {}]'.format(element.attrib['alt'])
return phraseable_attr
Selecting the first descendant by name
The following function can be considered equivalent to the
CSS selector ‘parent descendant:first-of-type
’.
This is very useful, for example in situations where one
needs to extract a content block without having to predict
exactly at what depth it lies. The ElementTree module
contains some minimal XPath support, but I have explicitly
avoided that because in my experience, it confuses people and
makes them think that selecting XML nodes is an esoteric
operation that can only be carried out by specialized
declarative languages. Here, we are working with a
general-purpose imperative programming language and yet the
function is elegant and simple!
def first_descendant(parent: ET.Element,
descendant_qname: str) -> Union[ET.Element, None]:
if len(parent) == 0:
return
for child in parent:
if child.tag == descendant_qname:
return child
inner = first_descendant(child, descendant_qname)
if inner is not None:
return inner
This is a good starting point for working with narrative
XML in real scenarios. Of course, you will tell me that in
real working situations the functions should be methods
defined in a new class that inherits
from ET.Element
, and I agree with you! However, I
had initially defined my module in this simplistic way because
the students were being taught recursion with functions
generally defined at the module level. This is legitimate in
the sense that Python can also be used in a functional style,
in which the use of classes would be more conservative and
instead higher-order functions would be defined, for
instance. I would have liked going so far as to introduce some
classic higher-order functions, but unfortunately there was no
time (semesters always seem so short when teaching).