Recent Articles

Entries in Textual Analysis in Tinderbox (16)

Monday
Aug032009

Slicing through the Knowledge Café

Many knowledge management programs seek to gain ideas, insight and feedback from diverse members of the workforce. One type of exercise designed to generated diverse ideas is the Knowledge Café.

In the Knowledge Café, people from diverse areas of the business sit themselves around a table and engage in a conversation about a particular topic. Ideas that are stimulated by the discussion are captured by the moderator, written on the table cloth, or inscribed by participants on post-it notes and stuck to a wall. An exercise of this nature can rapidly generate many hundreds of comments.

Once the event is successfully concluded, and the human resources department have more than a thousand responses, a range of questions naturally emerge:

  • What do we do with the comments?
  • How can we possibly get a handle on them all?
  • How can we give feedback to busy managers?
  • How do we characterise the results of the discussion?
  • What is the general mood of the organisation?

 

What not to do

Many corporates have fallen prey to consultancies who sell them on the idea of having their experts interpret the results of the knowledge café. The reports come back with fancy language and heatmaps and recommendations that generally necessitate further study. And the organisation, whether actively or passively, resists.

Self-interpretation is a better approach

It is far better to allow staff and management to interpret the results of a knowledge café. Not being analysts, staff and management obviously cannot get their heads around the entire corpus. What they need is a way to slice through the corpus to access the comments relevant to the person, the role or the current need.

How to build a knowledge slicing machine with Tinderbox

My approach is to build a mechanism to slice the comments into multiple vectors. With those vectors, I generate HTML pages that enable staff and management to access the cluster of topics that interest them.

To build a knowledge slicing machine:

  1. Using a concordancing tool, generate a list of all words included in the comment. Exclude common words with a stoplist. Cluster multiple word forms (stemmatization). Sort by frequency of usage.
  2. Import these words into Tinderbox and throw them onto a map sorted alphabetically. Cluster the words into groups semantically-related sets.
  3. For each set, construct an agent that returns all comments containing the key terms in the semantic set.
  4. Import into Tinderbox the comments themselves, storing the comments themselves in a carefully marked container. Watch as the agents suddenly cluster their underlying comments.
  5. Sort the agents by number of children. Take the largest result sets and begin to construct agents that cluster sub-topic. Structure a range of sub-topics.
  6. Having sliced and diced the comments a hundred different ways, export it into HTML for general sharing back to the organisational community.

 

Example from a large government trading organisation

The example presented here consists of 1042 comments that have been clustered into 249 semantic sets. I'm not at liberty to actually show you the comments themselves, but whenever you see the disclosure triangles you can be sure the comments are lurking just a click away.

Semantic sets, sorted by volume of comments.

These terms are not simple word searches, as they cluster a range of words that signify a particular meaning. For example, the staff query is written to aggregate comments with the words staff, employee and people.
$Prototype=Comment&(Text(staff)|Text(employee)|Text(people))

In this organisation, "people" is a synonym of employee; never of passenger. Here's the comparison with the passenger query.
$Prototype=Comment&(Text(customer)|Text(passenger)|Text(crowd))


One of the major topics in this discussion surrounded incidents. The incident topic consists of a range of sub-topics, which are ordered in this image by the number of comments within the sub-topic.

Major topics structured into sub-topics.
Again, each of these groupings are powered by queries that build on the initial semantic sets. Here's the query for "incident alerting".
inside(incident)&Text(report)

One of the key needs of the Knowledge Management function within the Human Resources department of this organisation was to get a handle on natural language in context. To assist in this, I structured various key terms into a thesaurus-like outline.

Top level of the Thesaurus outline.
The image that follows shows a subset of the unfurled ‘people’ outline.

The ‘people’ branch of the thesaurus outline.
Not only is it useful to identify a structure of the nouns in the language, but also of the verbs, modal auxiliaries and key objects that signify change, as shown below.

Just one screenful of the ‘change’ thesaurus outline.
The terminology used as signifiers of change was interesting enough to classify semantically; separators group related usages.

The analysis for the thesaurus outline was all performed manually. But including it within the tool enables people to link across to the comments that embody the term.

HTML Export
All images in this article represent views of the Tinderbox work environment. In order to share the ideas freely across the organisation, the data is exported into HTML pages. The HTML captures all the views of the data shown in this article. Each semantic set and topic structure is presented with word clouds auto-generated into each major grouping.

Sample word cloud auto-generated into the HTML output.

Analytical Reuse

In addition to the inherent usefulness for slicing through the mass of data in order to derive meaning, a major benefit of this approach is the analytical reuse. Subsequent knowledge cafés in this organisation already have a pre-built analytical framework. Leveraging this framework for further knowledge café events involves:

  1. Pouring the comments into the tool.
  2. Testing for unique terms not yet covered; extending as needed.
  3. Automatic generation into HTML.

Because of this approach to dynamically building result sets based on a domain-specific semantic analysis, further knowledge cafés have been processed within a few hours. Virtually all of those hours are spent performing step 2.

Tinderbox performs admirably running a document containing ~300 agents. When you consider the minimal time for constructing the initial analysis, the opportunities for reuse, and the flexibility in constructing solutions to new challenges that may arise, you realize that Tinderbox provides an ideal base for building custom, domain-specific knowledge management applications.

Sunday
Jul262009

Coding with Footnotes and Links—The Mechanics

Introduction


The GettysburgCFL design target enables the coding of texts in a range of humanities disciplines, particularly targeting sociological and linguistic textual analysis. While the GettysburgCFL is not completely generalised, it forms a well-conceived prototype that can be adapted to fit a range of textual coding needs. It demonstrates the Tinderbox community's current known best-practice in the coding of texts within Tinderbox.

You can download the GettysburgCFL Tinderbox document.
You can read the user documentation.

This article describes the design of the mechanisms so that you can adapt it to your own textual coding needs. It is part of a larger series exploring the range of practices involved in textual analysis, the extent to which the current Tinderbox affordances support the required practices, and what further affordances may benefit textual analysts involved in analyzing legal and literary texts.

Footnote tool


GettysburgCFL anticipates analysts importing the text into Tinderbox as a single note. That single note is decomposed into constituent parts (paragraphs, sentences, clauses, phrases, words, morphemes) using a combination of Tinderbox Explode tool (for paragraphs), and the Tinderbox Footnote tool.

See the user documentation for details.

Codes


The codes with which you wish to assign to units of text should be created as notes inside the Codes section. Each code consists of:

  1. Name attribute: Set to the human-readable name of the code.
  2. CodeName attribute: Set to the machine-readable name of the code. In my case, I'm outputting HTML-like code fragments, so the machine-readable names conform to the subset of characters that are valid in HTML codes.
  3. Prototype: For the subset of your codes that are mutually exclusive, you may optionally choose to make the note realizing your code a prototype. If you do choose to use a prototype, assign a Color or BorderColor that visually represents the code in your mind.

Codes can form a hierarchy. You can see the way I have formed a hierarchy within the Ideational set of codes.

You may only use prototype inheritance when your codes are mutually exclusive. If a subset of codes are mutually exclusive, you can use prototype inheritance on that subset; don't apply inheritance to the remainder. If your mutually exclusive subset forms the dominant coding priority, then it definitely makes sense to use prototype inheritance.

Prototype inheritance is achieved by using both links and rules, as described below.

Links


GettysburgCFL uses several typed links. The link-types are leveraged by rules to automate assignment of prototypes and code fragments to the notes. Because of this, the name of the typed link and must match naming within the rules.

GettysburgCFL uses the following typed links:

  • •Ideational—the dominant coding priority in GettysburgCFL.
  • •CodeLink—all codes that are not guaranteed to be mutually exclusive with the dominant coding priority are assigned using the type CodeLink.
  • •StartCodes—the beginning of a code assigned to a note where the closing code for the syntagm does not exist within the note. This allows for decomposition of the document to suit the dominant coding priority, while allowing other codes to be assigned across multiple notes.
  • •EndCodes—used to link the note that ends the coded syntagm that was begun using the •StartCodes link type assigned to a different note.
  • •ContainedCodes—used to indicate that the syntagm held by the note participates in a syntagm that spans notes, but this particular note neither begins nor ends the syntagm.


You may wish to generalize the •Ideational name, which is specific to a coding system derived from Systemic Functional Linguistics. If you do, it should probably be called something like: •PrototypedLink. If you do change the name of this link type, you also have to change the rules written in the note named TextPart. (The rules are in the note, not in the note's Rules attribute.)

Rules


All the rules that activate the links are stored in a single note: TextPart. Here's an explanation of each rule.

$Prototype=links.outbound.•Ideational.$Name;

This rule takes the name of the Code, and assigns it as the Prototype of the note being coded. It assumes that the codes to which •Ideational links are assigned are Prototypes.

$Codes=$Codes+links.outbound.•Ideational.$CodeName;
$Codes=$Codes+links.outbound.•CodeLink.$CodeName;

These rules find the machine-readable code fragment held by the Code and assigns it to the set attribute $Codes held in the coded note. The first statement does it for the dominant coding priority; the second statement for all other codes.

$StartCodes=$StartCodes+links.outbound.•StartCodes.$CodeName;
$ContainedCodes=$ContainedCodes+links.outbound.•ContainedCodes.$CodeName;
$EndCodes=$EndCodes+links.outbound.•EndCodes.$CodeName;

These rules take the codes assigned through each of the typed links, and assign it to the set attribute held by the note.

$AllCodes=$StartCodes+$Codes+$ContainedCodes+$EndCodes;

This is a convenience function for the analyst. It combines all the different types of codes into a single set attribute.

Rule Assignment


It is essential that the rules encoded in TextPart are assigned to all the notes containing source text fragments. GettysburgCFL achieves this through the AssignTextPartRule agent. The agent's query is:

descendedFrom(Gettysburg Address)

Unless you're analyzing the Gettysberg Address, you will have to change the name of the note, and therefore the name of the text pattern in this query. If you have multiple documents, you will need to change the scope of the query to ensure that all notes containing text fragments are retrieved by the query.

The agent's action is:

$Rule=$Text(TextPart)

which simply means to take the text from the TextPart note and assign it to the Rule attribute for all the notes retrieved by the query.

Why didn't I just use prototype inheritance?
When I perform textual analysis using Tinderbox, I typically assign the decomposed text notes a Prototype named something like, "SourceText." I use it to hold attributes that I want all my source notes to share.

GettysbergCFL does not conform to that pattern, because I am concerned that the assignment and reassignment of prototypes may cause inconsistent inheritance across all the syntagm fragments. Should that occur, the mechanism is likely to break down.

Notionally, this shouldn't occur if all code-prototypes inherited from the SourceText. (Much like all Smalltalk classes inherit from Object.) But I hadn't conceived of that design in time to apply it to this document; and, one might still run into problems, especially if you get adventurous trying to automate additional facets of the analytical mechanism. Also, I have no particular default Prototype set in this document, which would be essential were you wanting to use prototype inheritance.

Cleanup


When you code a note, then change your mind about the applicability of the coding, you potentially leave the old code recorded within the note's attributes, even though you've removed the links. To clean these up:

  1. Enable the Cleanup_RunOnceThenSwitchOff agent.
  2. Select File > Update now.
  3. Switch off the agent.

The agent completely clears out all code fragment fields in all source text notes. This allows the code fragment fields to be rebuilt from the current links only.

Nakakoji template


This is the Nakakoji template I'm using to export the text.

^value(format($StartCodes, "", "<", ">", ""))^^value(format($Codes, "", "<", ">", ""))^^if(^children^)^^children(/TEMPLATES2/•LGOutput)^^else^ ^title^^endIf^^value(format($Codes, "", "</", ">", ""))^^value(format($EndCodes, "", "</", ">", ""))^

It creates text output along these lines:

<Clause_Independent> <Circumstance_Temporal><Theme> Fourscore and seven years ago</Circumstance_Temporal></Theme><Rheme><Nominal_Actor> our fathers</Nominal_Actor><Process_Relational_Existence> brought forth</Process_Relational_Existence><Circumstance_Locative> on this continent</Circumstance_Locative><Nominal_Goal> a new nation</Nominal_Goal></Rheme> </Clause_Independent>

The core of the Nakakoji template is this decision:

^if(^children^)^^children(/TEMPLATES2/•LGOutput)^^else^ ^title^^endIf^

It asks: "Are you a leaf node, or not? If you have children, I'll go and see what they want me to do. But if you're a leaf node, I'll output your text."

The wrapping code before the core decision is:

^value(format($StartCodes, "", "<", ">", ""))^^value(format($Codes, "", "<", ">", ""))^

It says: "Let me format all your StartCodes and Codes fragments as HTML-like tags." The matching segment at the back does the same thing, excepting that it issues the closing HTML-like tags.

Template Summary: The template says, "I will descend the Tinderbox outline containing the text. All the codes assigned at any level of rank are output into the Nakakoji, but only leaf nodes emit text."

For this reason, a text can be coded at varying levels of depth, but any one branch must be coded at the same level of depth to ensure no text is skipped.

 

Next article: Visualising Textual Analyses

Saturday
Jul252009

Coding with Footnotes and Links

On the Tinderbox forum, Jean Goodwin inspired Paul Walters to demonstrate a technique by which text can be broken down using the footnote tool and codes assigned by creating a link to notes representing individual codes within the coding system. Jean then demonstrated how rules can be used to aggregate the codes and assign prototypes. I've refined their demonstrations, overcoming several limitations with respect to coding and output.

Feel free to download the Tinderbox file illustrating these techniques.

This article describes how to use the Tinderbox file. It covers:

  1. An overview of the analytical workspace.
  2. A walk-through as to how to code using the mechanism.
  3. Adding new codes.
  4. Changing the coding.
  5. Coding from multiple dimensions.
  6. Coding disjoint syntagms.
  7. Assumptions.

The mechanics are described in this followup article.

The Workspace


This is my preferred workspace.

Preferred workspace for textual analysis coding.

The outline contains my Prototypes, Agents and Templates sections, which I have come to adopt as a standard for all my Tinderbox projects. The Gettysburg Address note introduces the text itself, and could very easily contain the entire text. (It doesn't in my example file, but to be fully consistent with the logic I'm describing here, it could be.)

The Gettysburg Address text is presented hierarchically, starting with the overall text itself (Gettysburg Address), then showing paragraphing (Paragraph 1, Paragraph 2, Paragraph 3), then each sentence, followed by each clause, with phrases below that.

The Codes Outline is really a subset of the initial outline. I've opened this additional outline so that the Codes will always be available, which is extremely important in making this technique work productively.

The Map view allows for additional visualisation opportunities. In this case, I've purposely aligned the Process (i.e. verbal phrase) in each clause. The fact that the map provides for additional juxtapositional possibilities represents a significant benefit relative to the map-based coding system illustrated in my previous article.

How to code with footnotes and links


This text is only partially coded. I'm going to navigate to a section of text that is not yet coded, and then begin the coding process. I'll walk you through the process step by step. You'll find the system pretty easy to use.

Partially analysed text, prior to coding.

Paragraph 2 has previously been broken into sentences (represented by the code ClauseComplex). The sentence starting "We have come…" has also been broken down into clauses. The first clause's phrases have been analysed. So, I'll continue by analysing the phrases in the projected clause "to dedicate…".

Select the note "to dedicate…" and click <space>. The note appears. (Ignore the thicket of attributes for now. Each one will be explained shortly.)

The note representing a clause, prior to any additional coding.

Mark out the next phrase.

Verbal phrase highlighted within the clause.

Then execute the Add Footnote As Child command by pressing Command+Shift+F. A second note appears on the screen.

Note representing the verbal phrase "to dedicate," popped up over top of its parent clause.

Now press the link icon and select the target code. The link window appears, and I choose the •Ideational link-type.

The • in •Ideational is a handy way of identifying the special link types in this application. The identifier •Ideational signifies that links of this type represent codes from the Ideational metafunction. Systemic Functional Linguistics posits that there are three metafunctions in language: Experiential, Interpersonal and Textual. Of these, the Experiential metafunction has two sub-components, Ideational and Logical. Hence my terminology.

NOTE: I've coded it as a Mental process (Process:Mental), because that's in my currently-available code-set, even though I suspect it should probably be coded as Process:Verbal. I'll return and clean this up soon.

Selecting an •Ideational link type.

Now close the note representing the new phrase. The original note now has a text link it. If you click on the text link (Command+Option+click) it will open the child note representing the phrase we've just coded.

The clause with the linked verbal phrase.

Notice the change in the map view. The verbal phrase "to dedicate" now appears within the clause. More than that, the Process:Mental prototype has been automatically applied to the phrase. (Magic? No: But, like any sufficiently advanced technology, it is indistinguishable from magic until you're trained in the art: Read on.)

Map showing the results of the coding.

Now I'm going to quickly repeat this process to code the nominal and circumstantial phrases. Look at the outline view to see the effect.

Outline view showing the newly-coded phrases.

The outline view is the easiest means of traversing the rank hierarchy. It provides a random-access entry point to the text being coded, whereas the map view provides a visually stimulating means for considering the outcome of the coding. By careful arrangement of the inside container views on the map, the analyst can see two levels of rank at a time, giving reasonable context for the analyzing the subsequent coded text.

So that's the basic coding process. To see what the final coded output looks like:

  1. Select the Gettysburg Address note.
  2. Create a New Nakakoji View.
  3. Choose the option Selected notes.
  4. Choose the Text export template /TEMPLATES2/•LGOutput


Now you can see the marked up text using an HTML-inspired coding representation.

Nakakoji output of the Gettysburg Address. Text is unevenly analysed, resulting in only a portion of the speech being visible.

Adding new codes


It's easy to expand the codes while the coding is in progress. Previously, I really wanted a Process:Mental code. So, let's add it now.

Simply locate the Process hierarchy in the Codes window. Select Process:Mental and duplicate it (Command+d). Press the space bar, and change:

  1. Name from Process:Mental to Process:Verbal.
  2. CodeName from Process_Mental to Process_Verbal.


There we have it. A new code. Now it's just a matter of changing the original coding.

Changing the coding


Changing the coding is slightly more complicated than you'd imagine. What you have to do is this:

  1. Locate the note and open it.
  2. Delete the existing link. (Within the note, open the links window, highlight the link, press delete.)
  3. Create a new link to the code.


And then, just before you're ready to generate your Nakakoji view, you should switch on the agent Cleanup_RunOnceThenSwitchOff. The agent scavenges of the remnants of old codes and gets rid of them. (Remember to switch it off. Unpredictably nasty things will appear in your Nakakoji if you leave it enabled. You're warned.)

Coding from multiple dimensions


As I've mentioned previously, Systemic Functional Linguistics calls for coding text from multiple perspectives. Many coding systems do. The mechanisms within this approach allow for multiple codes to be attached to any fragment of text.

Although the sample file allows for multiple codes, the Ideational perspective is privileged over Logical, Intepersonal, Textual or Cohesion codes in two ways:

  1. By being the only code hierarchy that are Tinderbox prototypes.
  2. By forming the principle by which the text is decomposed into finer levels of rank.

When you link a syntagm with an ideational code using the •Ideational link, the code is added to the coded note's Prototype attribute.

So, to code from multiple parts of the hierarchy:

  1. Use at least one Ideational code, marked with an •Ideational link-type.
  2. Any other codes must be assigned using the •CodeLink link-type.

(Unpredictable things happen if you have multiple links typed as •Ideational.)

Coding disjoint syntagms


In the sample file, in all the areas I've finished coding, I've finished at the phrase rank. Each phrase attracts opening and closing HTML-style tags in the Nakakoji view. Because the text is decomposed according to the ideational perspective, neither the phrase level nor the clause level represents the starting and ending of some tags. Paragraph 1 illustrates this.

The phrase, "Fourscore and seven years ago," can be linked to the Textual code of Theme. But the textual Rheme consists of the remainder of the phrases. We want output Nakakoji output like this:

<Clause_Independent> <Circumstance_Temporal><Theme> Fourscore and seven years ago</Circumstance_Temporal></Theme><Rheme><Nominal_Actor> our fathers</Nominal_Actor><Process_Relational_Existence> brought forth</Process_Relational_Existence><Circumstance_Locative> on this continent</Circumstance_Locative><Nominal_Goal> a new nation</Nominal_Goal></Rheme> </Clause_Independent>

Not with repeating <Rheme> and </Rheme> tags like this:

<Clause_Independent> <Circumstance_Temporal><Theme> Fourscore and seven years ago</Circumstance_Temporal></Theme><Rheme><Nominal_Actor> our fathers</Rheme></Nominal_Actor><Process_Relational_Existence> <Rheme> brought forth</Rheme></Process_Relational_Existence><Circumstance_Locative> <Rheme> on this continent</Rheme></Circumstance_Locative><Nominal_Goal> <Rheme> a new nation</Rheme></Nominal_Goal></Clause_Independent>


So, to code disjoint syntagms:

  1. Select the opening note, and link it to the desired code using the •StartCodes link-type.
  2. Select the closing note, and link it to the desired code using the •EndCodes link-type.
  3. Select each intervening note, and link it to the desired code using the •ContainedCodes link-type.


That will generate the desired output.

Assumptions


Be aware of these assumptions:

  1. While you can code at multiple levels of detail through a document, at any one branch you need to code at a uniform level. So, if you code an individual word within a phrase, then all the words in that phrase need to be represented at the word level. (Sibling phrases need not be analysed at the word level.)
  2. Because prototypes are assigned through the primary code hierarchy, the primary hierarchy of codes must be mutually exclusive.
  3. StartCodes, ContainedCodes, EndCodes do not communicate prototype inheritance. Only the primary coding system communicates prototype inheritance.

Oops

There is one mistake I keep making: frequently forgetting to create the footnote prior to linking it. This results in me marking out the text in the clause, then immediately linking that text to a code. It's visually similar to the result I want, but the mechanics relies on coding each phrase separately. When I do make that mistake, I have to delete the link, and then remember to run the Cleanup_RunOnceThenSwitchOff agent prior to generating the Nakakoji.


Next Article: Coding with Footnotes and Links—The Mechanics

Monday
Jul132009

Textual Cleansing, Tokenizing and Reassembly

On the Tinderbox forum, Mark Anderson demonstrated how to cleanse, tokenize and reassemble text using Tinderbox and allied command-line utilities. I extended that work to enable graphical coding with textual output automatically incorporating the analytical coding. This article describes the mechanisms developed for these tasks.

Feel free to download the Tinderbox file illustrating these techniques.

Cleansing

Instead of using external programs to cleanse a source text, as I did with Microsoft Word, Mark's approach is to bring the text into Tinderbox, then to pipe it to sed for cleanup. Mark put the following command into a note's Rule:

$Text |= runCommand("echo" + $Text(The Gettysburg Address) + $Text(Cleanup Code))

Here's what each portion of the command does:

  1. runCommand — invokes the command-line.
  2. echo — prints commands to standard output.
  3. $Text(The Gettysburg Address) — accesses the original text in the note named "The Gettysburg Address".
  4. $Text(Cleanup Code) — accesses the address of another note containing the command codes themselves.
  5. $Text |= — assigns the cleansed text back into the current note's $Text attribute. The assignment operator is conditional: if the $Text attribute has a length of zero, then assignment will occur; once $Text contains a non-zero length, it will avoid assignment. Conditional assignment is essential for efficiency: no use getting your computer repeatedly do work that needs doing only once!

In Mark's example, the Cleanup Code note contains the following commands:

| sed 's/…/… \t/g'
| sed 's/—/— \t/g'
| sed 's/\\.\\.\\./\\.\\.\\.\t/g'
| sed 's/ / \t/g'


These commands are designed to clean up the following issues:

  • the first line finds all true ellipses, placing a tab after each ellipsis
  • the second line places a tab after em-dashes
  • the third line places a tab after faux ellipses
  • the last line replaces all spaces with tabs.

In each case, the g modifier tells sed not to stop once it has made its first match, but to match every instance of the pattern in the text.

Tokenizing

By replacing all spaces with tabs, Mark makes it possible to use Tinderbox's explode command to tokenize the text. By tokenize, I mean to place every token, that is, word, in a separate note. With each word in a separate note, the analyst is free to manipulate the individual tokens with all the machinery available to notes.

Textual Reassembly

Tinderbox's Nakakoji view is prototypically used for exporting text outlines. Unlike the HTML View, by default Nakakoji View does not require templates to contain the traversal mechanism. Instead, Nakakoji View traverses the outline itself. As such, myself (and others) didn't realize the value of the Nakakoji view. It took Mark Anderson to demonstrate to us its use before we realized what it provides us.

Mark Anderson showed us that Nakakoji View is a general-purpose text construction tool. As such, he demonstrated how to reconstruct the entire Gettysburg Address as a readable text, even though every word had been tokenized by being placed in separate notes.

The technique is as follows:

  1. Create an agent that retrieves the source tokens. The agent's sort order is set to OutlineOrder. (Alternatively, you could choose to sort by SiblingOrder, which appears to be roughly similar alternatives.)
  2. In the outline view, select the target agent.
  3. Open the Nakakoji View.
  4. Change the Nakakoji option to Selected notes.
  5. Select the appropriate template. The template itself specifies that all children will be assembled into a text with spaces between each word.

The Nakakoji View window then displays the reconstructed text.

Textual Reassembly with Analysis

While reassembling the exploded text into readable form demonstrated the technique, the analyst experiences the real payoff when she can reassemble the text with analytical codes.

To do this, I took the Mark Anderson's tokenized text and opened notes in Map View. My intent was to visually assign markup to the tokens.

The first step is to drag the tokens for a clause complex (i.e. a sentence) into a single line. Here is the first phrase in the first clause complex of the Gettysburg address.

Tokens in Tinderbox's Map View.
The second step is to begin constructing a set of codes to apply to the text. For some analytical work, it is good practice to design the codes prior to beginning the markup. However, in other cases, it is good to have the flexibility to develop the codes alongside the coding activity itself. For any practical work, it is necessary to have the flexibility in the toolset to be able to do this.

In this example, while the coding scheme I used is mature, having been developed over the last 30 years by the likes of M.A.K. Halliday, Ruqaiya Hassan, J.R. Martin and Christian Matthiessen, my entry of the codes proceeded incrementally with the textual analysis.

Each code is created as a Prototype, having a unique colour for immediate visual discrimination.

Lexicogrammatical Codes from System Functional Linguistics' Ideational Metafunction.
With a minimal number of codes in place, I was able to begin depicting my analysis. This image shows the marking of the first nominal phrase.

The circumstantial phrase that opens the Gettysburg Address.
The green note marking the Circumstantial-phrase is an ordinary note whose prototype is Circumstantial-phrase. In laying the note on the map, it is important to not accidentally place the note inside one of the notes already on the map. Although Tinderbox is geared to hierarchically structured spatial outlines, we're wanting to work within the same spatial plane.

When you are applying a code for the very first time in a document, you create the note and place it on the map. Then you press the Enter key to display the Rename dialog. From the Rename dialog's Prototype drop-down list, you select the desired code. In this case, I selected Circumstantial-phrase.

After the first application of a code, whenever you create a note and place it on the map, you open it by pressing the space bar. This displays the note contents. If the Prototype attribute is not already showing in the note content header, you select it as a Key Attribute. When the Prototype Key Attribute is visible, you can select the code from the drop-down list on the right. For the sake of keeping track of the analysis, I elected to type into the note's Name the syntagm for which it coded.

The goal of applying coding in the Map View is two-fold:

  1. To visually work with the coding system, to enable the analyst to see the richness of the coding through spatial and colour discrimination.
  2. To be able to produce a coded textual output capable of being processed and visualised in other programs.


The magic that achieves these goals builds combines an export template and an agent. Here is the export template.

Tinderbox Export Template showing the logic for each item in Map View.
Here's what this code means:

  1. The first if statement prints out the note's Name if the note's Prototype is of SourceTokens. SourceTokens is my name for the tokens representing the source text, i.e., the words of the Gettysburg Address. They are the cream-coloured notes.
  2. The second if statement asks the Note whether or not it is an EndTag. (The EndTag attribute is a boolean, which I'll explain soon.) If the note's EndTag boolean attribute is true, the template will print a note similar to an HTML-code close tag, e.g.: </Circumstantial-phrase>
  3. The final else statement simply prints the name of the Prototype, wrapped in an HTML-style open code tag, e.g.: <Circumstantial-phrase>

So, this one template caters for three cases:

  1. The note is a word in our original text, in which case it prints the word.
  2. The note is a code, in which case it prints an opening code: <Circumstantial-phrase>
  3. The note is an EndTag for a code, in which case it prints a closing code: </Circumstantial-phrase>

To assemble the text, I used five agents. The agents have the following roles:

  1. To assemble the SourceTokens in the correct sequence.
  2. To assemble the code StartTags in the correct sequence.
  3. To assemble the code EndTags in the correct sequence.
  4. To assemble both StartTags and EndTags in their correct sequences.
  5. To construct the final set of analysed notes, in the correct sequence.

In each case, "correct sequence" means to sort the text by Ypos then by Xpos. SourceTokens at the top of the map are arranged before source tokens below them. SourceTokens to the left of the map are arranged also arranged before those on the right. So the placement of the coding notes on the map are the key to indicating exactly where the codes apply to the text.

SourceTokens agent. It collects notes from the Map View and arranges them in order of their spatial arrangement in the map.
Of the five agents, I tried hard to avoid designing the EndTags agent the way I did. What I wanted to do was to have the StartTags agent generate an alias to represent the starting of the tag using the Ypos and Xpos attributes. I then wanted the EndTags agent to generate from the same coded note the YExtent and XExtent, i.e. the bottom right corner of the note. I then wanted to combine these two sets of aliases to be able to produce the HTML-like opening and closing tags. Due to the design of Tinderbox aliases, it is simply not possible to do this.

Instead, I was forced to manually duplicate the tag on the Map view, and then arrange it to follow the syntagm for which it was following.

Tinderbox Lexicogrammatical coding on the Map View. Illustrating the use of EngTags.
In the above example, the green and blue bars descending from the phrase codes represent the EndTag notes. To create the EndTag notes, I:

  1. Duplicated the coded note.
  2. Reshaped it so that it's Ypos begins at the same level as the SourceTokens, and the Xpos begins after the final SourceToken in the phrase.
  3. Applied a stamp that set the EndTag boolean attribute to true.

Here is what the Gettysburg Address looks like. Only the first sentence is coded.

The complete Gettysburg Address in Map View. The first sentence is fully coded.
Here's a closer view of the first sentence.

Beginning of the first sentence of the Gettysburg Address, coded against a minimal set of ideational lexicogrammatical tags.

After generating the final, coded text through the Nakakoji View, this is the output.

The assembled output, where the first sentence is coded with a very basic set of Systemic Functional Linguistic Ideational codes.

Conclusion

The outlined techniques represent non-obvious application of a few of the affordances provided by Tinderbox for the purpose of textual analysis. The collaboration that emerged through the serial one-upmanship on the Tinderbox forum enabled Mark Anderson and I to conceive and implement the techniques. While the mechanisms are innovative, there are some very real issues in conducting a textual analysis study using them. A later article will outline the benefits and drawbacks of these techniques.

Next Article: Coding with Footnotes and Links

Monday
Jul132009

State of Textual Analysis Research

Over the last week at the Tinderbox forum, Mark Anderson, Jean Goodwin, Paul Walters and Charles Turner have been advancing the collective best practices for Textual Analysis in Tinderbox. The spirit of the advance is characterised by practical experimentalism.

The next four posts on this blog aim to summarize the most successful aspects of the experimental results.

I now see that the Textual Analysis in Tinderbox research will produce five streams of output:

  1. A summary of best practices for textual analysis in Tinderbox.
  2. Practical Tinderbox examples of textual analyses.
  3. A statement of requirement for digital tool support of textual analysis.
  4. Analysis of the affordances provided by Tinderbox for textual analysis.
  5. Design recommendations for Tinderbox to better support textual analysis.