Recent Articles
« What is easy will be used | Main | State of Textual Analysis Research »
Monday
Jul132009

Textual Cleansing, Tokenizing and Reassembly

On the Tinderbox forum, Mark Anderson demonstrated how to cleanse, tokenize and reassemble text using Tinderbox and allied command-line utilities. I extended that work to enable graphical coding with textual output automatically incorporating the analytical coding. This article describes the mechanisms developed for these tasks.

Feel free to download the Tinderbox file illustrating these techniques.

Cleansing

Instead of using external programs to cleanse a source text, as I did with Microsoft Word, Mark's approach is to bring the text into Tinderbox, then to pipe it to sed for cleanup. Mark put the following command into a note's Rule:

$Text |= runCommand("echo" + $Text(The Gettysburg Address) + $Text(Cleanup Code))

Here's what each portion of the command does:

  1. runCommand — invokes the command-line.
  2. echo — prints commands to standard output.
  3. $Text(The Gettysburg Address) — accesses the original text in the note named "The Gettysburg Address".
  4. $Text(Cleanup Code) — accesses the address of another note containing the command codes themselves.
  5. $Text |= — assigns the cleansed text back into the current note's $Text attribute. The assignment operator is conditional: if the $Text attribute has a length of zero, then assignment will occur; once $Text contains a non-zero length, it will avoid assignment. Conditional assignment is essential for efficiency: no use getting your computer repeatedly do work that needs doing only once!

In Mark's example, the Cleanup Code note contains the following commands:

| sed 's/…/… \t/g'
| sed 's/—/— \t/g'
| sed 's/\\.\\.\\./\\.\\.\\.\t/g'
| sed 's/ / \t/g'


These commands are designed to clean up the following issues:

  • the first line finds all true ellipses, placing a tab after each ellipsis
  • the second line places a tab after em-dashes
  • the third line places a tab after faux ellipses
  • the last line replaces all spaces with tabs.

In each case, the g modifier tells sed not to stop once it has made its first match, but to match every instance of the pattern in the text.

Tokenizing

By replacing all spaces with tabs, Mark makes it possible to use Tinderbox's explode command to tokenize the text. By tokenize, I mean to place every token, that is, word, in a separate note. With each word in a separate note, the analyst is free to manipulate the individual tokens with all the machinery available to notes.

Textual Reassembly

Tinderbox's Nakakoji view is prototypically used for exporting text outlines. Unlike the HTML View, by default Nakakoji View does not require templates to contain the traversal mechanism. Instead, Nakakoji View traverses the outline itself. As such, myself (and others) didn't realize the value of the Nakakoji view. It took Mark Anderson to demonstrate to us its use before we realized what it provides us.

Mark Anderson showed us that Nakakoji View is a general-purpose text construction tool. As such, he demonstrated how to reconstruct the entire Gettysburg Address as a readable text, even though every word had been tokenized by being placed in separate notes.

The technique is as follows:

  1. Create an agent that retrieves the source tokens. The agent's sort order is set to OutlineOrder. (Alternatively, you could choose to sort by SiblingOrder, which appears to be roughly similar alternatives.)
  2. In the outline view, select the target agent.
  3. Open the Nakakoji View.
  4. Change the Nakakoji option to Selected notes.
  5. Select the appropriate template. The template itself specifies that all children will be assembled into a text with spaces between each word.

The Nakakoji View window then displays the reconstructed text.

Textual Reassembly with Analysis

While reassembling the exploded text into readable form demonstrated the technique, the analyst experiences the real payoff when she can reassemble the text with analytical codes.

To do this, I took the Mark Anderson's tokenized text and opened notes in Map View. My intent was to visually assign markup to the tokens.

The first step is to drag the tokens for a clause complex (i.e. a sentence) into a single line. Here is the first phrase in the first clause complex of the Gettysburg address.

Tokens in Tinderbox's Map View.
The second step is to begin constructing a set of codes to apply to the text. For some analytical work, it is good practice to design the codes prior to beginning the markup. However, in other cases, it is good to have the flexibility to develop the codes alongside the coding activity itself. For any practical work, it is necessary to have the flexibility in the toolset to be able to do this.

In this example, while the coding scheme I used is mature, having been developed over the last 30 years by the likes of M.A.K. Halliday, Ruqaiya Hassan, J.R. Martin and Christian Matthiessen, my entry of the codes proceeded incrementally with the textual analysis.

Each code is created as a Prototype, having a unique colour for immediate visual discrimination.

Lexicogrammatical Codes from System Functional Linguistics' Ideational Metafunction.
With a minimal number of codes in place, I was able to begin depicting my analysis. This image shows the marking of the first nominal phrase.

The circumstantial phrase that opens the Gettysburg Address.
The green note marking the Circumstantial-phrase is an ordinary note whose prototype is Circumstantial-phrase. In laying the note on the map, it is important to not accidentally place the note inside one of the notes already on the map. Although Tinderbox is geared to hierarchically structured spatial outlines, we're wanting to work within the same spatial plane.

When you are applying a code for the very first time in a document, you create the note and place it on the map. Then you press the Enter key to display the Rename dialog. From the Rename dialog's Prototype drop-down list, you select the desired code. In this case, I selected Circumstantial-phrase.

After the first application of a code, whenever you create a note and place it on the map, you open it by pressing the space bar. This displays the note contents. If the Prototype attribute is not already showing in the note content header, you select it as a Key Attribute. When the Prototype Key Attribute is visible, you can select the code from the drop-down list on the right. For the sake of keeping track of the analysis, I elected to type into the note's Name the syntagm for which it coded.

The goal of applying coding in the Map View is two-fold:

  1. To visually work with the coding system, to enable the analyst to see the richness of the coding through spatial and colour discrimination.
  2. To be able to produce a coded textual output capable of being processed and visualised in other programs.


The magic that achieves these goals builds combines an export template and an agent. Here is the export template.

Tinderbox Export Template showing the logic for each item in Map View.
Here's what this code means:

  1. The first if statement prints out the note's Name if the note's Prototype is of SourceTokens. SourceTokens is my name for the tokens representing the source text, i.e., the words of the Gettysburg Address. They are the cream-coloured notes.
  2. The second if statement asks the Note whether or not it is an EndTag. (The EndTag attribute is a boolean, which I'll explain soon.) If the note's EndTag boolean attribute is true, the template will print a note similar to an HTML-code close tag, e.g.: </Circumstantial-phrase>
  3. The final else statement simply prints the name of the Prototype, wrapped in an HTML-style open code tag, e.g.: <Circumstantial-phrase>

So, this one template caters for three cases:

  1. The note is a word in our original text, in which case it prints the word.
  2. The note is a code, in which case it prints an opening code: <Circumstantial-phrase>
  3. The note is an EndTag for a code, in which case it prints a closing code: </Circumstantial-phrase>

To assemble the text, I used five agents. The agents have the following roles:

  1. To assemble the SourceTokens in the correct sequence.
  2. To assemble the code StartTags in the correct sequence.
  3. To assemble the code EndTags in the correct sequence.
  4. To assemble both StartTags and EndTags in their correct sequences.
  5. To construct the final set of analysed notes, in the correct sequence.

In each case, "correct sequence" means to sort the text by Ypos then by Xpos. SourceTokens at the top of the map are arranged before source tokens below them. SourceTokens to the left of the map are arranged also arranged before those on the right. So the placement of the coding notes on the map are the key to indicating exactly where the codes apply to the text.

SourceTokens agent. It collects notes from the Map View and arranges them in order of their spatial arrangement in the map.
Of the five agents, I tried hard to avoid designing the EndTags agent the way I did. What I wanted to do was to have the StartTags agent generate an alias to represent the starting of the tag using the Ypos and Xpos attributes. I then wanted the EndTags agent to generate from the same coded note the YExtent and XExtent, i.e. the bottom right corner of the note. I then wanted to combine these two sets of aliases to be able to produce the HTML-like opening and closing tags. Due to the design of Tinderbox aliases, it is simply not possible to do this.

Instead, I was forced to manually duplicate the tag on the Map view, and then arrange it to follow the syntagm for which it was following.

Tinderbox Lexicogrammatical coding on the Map View. Illustrating the use of EngTags.
In the above example, the green and blue bars descending from the phrase codes represent the EndTag notes. To create the EndTag notes, I:

  1. Duplicated the coded note.
  2. Reshaped it so that it's Ypos begins at the same level as the SourceTokens, and the Xpos begins after the final SourceToken in the phrase.
  3. Applied a stamp that set the EndTag boolean attribute to true.

Here is what the Gettysburg Address looks like. Only the first sentence is coded.

The complete Gettysburg Address in Map View. The first sentence is fully coded.
Here's a closer view of the first sentence.

Beginning of the first sentence of the Gettysburg Address, coded against a minimal set of ideational lexicogrammatical tags.

After generating the final, coded text through the Nakakoji View, this is the output.

The assembled output, where the first sentence is coded with a very basic set of Systemic Functional Linguistic Ideational codes.

Conclusion

The outlined techniques represent non-obvious application of a few of the affordances provided by Tinderbox for the purpose of textual analysis. The collaboration that emerged through the serial one-upmanship on the Tinderbox forum enabled Mark Anderson and I to conceive and implement the techniques. While the mechanisms are innovative, there are some very real issues in conducting a textual analysis study using them. A later article will outline the benefits and drawbacks of these techniques.

Next Article: Coding with Footnotes and Links

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>