Recent Articles

Entries in Textual Analysis Example 1 (6)

Sunday
Jun282009

Discovering poetry

In the last post, I demonstrated reflowing text. In this post, I'll demonstrate not only reflowing text, but splitting up the verses in order to be able to encode semantic relations within Tinderbox. As a by-product of doing this, I discovered a poem within the first chapter of First John. (I'm really, really excited about finding this poem!)

I think it was around about when I was analysing 1 John 1:7, I noticed that the text consisted of (1) a condition, and (2) two different consequences. Then I realised I'd seen the same pattern in 1 John 1:6. Because I enjoy marking out textual patterns, I was driven to mark the semantic relationship between the clause containing the condition, and the dependent clauses containing the consequences.

There is no current way in Tinderbox to mark these relationships. To accomodate my need to Tinderbox, I made two copies of the note containing 1 John 1:7. I then deleted part of the text in each, renaming the now-three notes, 1 John 1:7a, 1 John 1:7b, 1 John 1:7c. Having separated the three clauses, I was now able to creat links from 1 John 1:7a to each of the other two notes with a named semantic relation: consequence.

1 John 1:7. Segmented into three clauses. (Links are set to invisible.)As I progressed into 1 John 1:8, I again noticed that there was a condition and two consequences. The pattern was again repeated in 1 John 1:9 and 1 John 1:10. And when I reviewed 1 John 1:6, I saw that there was the same pattern there too. In fact, 1 John 1:5, the introduction to this segment of text, also contains three identifiable portions; but I'm still debating whether 1 John 1:5 is part of the pattern, is separate from the pattern but anticipates it, or should be understood as being distinct from the pattern evident in the following verses.

At the same time as discovering this pattern, I noticed that the author alternates between clause complexes relating to "walking in the darkness" and clause complexes about "walking in the light". I decided that this was obvious enough that it deserved to be indicated through a colour variation.

The English Standard Version translators entitled this section "Walking in the light." But in my view, the dominant theme in this text better represented as "Two ways of walking." So I grabbed an adornment and labeled the section as such.

First John 1:5-10. Emerging patterns displayed with spatial and colour discriminators.
Notice, on this adornment, I've slightly offset verses 6 through 10 to the right to indicate that verse 5 is the introduction to this section. It seems ironic that I'm visually representing the metaphor of an outliner in an outliner, albeit in that outliner's spatial view.

The next thing I noticed absolutely thrilled me: poetic balance. I noticed that the theme in 6c and 8c corresponded; so I drew a link between them. Then noticed that 10c also was on the same theme; another link. Then the pace of linking accelerated, because I noticed the theme in 6b, 8b and 10b corresponded too. Of course, it's easy to see that the themes in 6a, 8a and 10a are all pretty comparable (more links). In the other strand, 7c and 9c directly correspond. So, at this point, I became willing to accept that this entire pattern is poetically balanced through thematic correspondence.

First John 1:5-10. Links encoding analytical understanding of semantic relations between syntagms.
The text in 7b and 9b do not overtly share the same topic. But having accepted that this pattern is well motivated, it is reasonable to accept that the author intended 7b and 9b to be thematically complementary. By juxtaposing two concepts and calling them one, the author is implicitly enriching the ideation. Thus I conclude that the author is asserting that being forgiven of sins and having fellowship with one another is intimately related.

This same line of reasoning holds true for the interpretation of 7a and 9a. In fact, there is an additional level of parallelism in 7a and 9a that I haven't indicated in the diagram. Notice that in 7a and 9a, there are two components within each syntagm: the first component describes "we"; the second component describes "he". Let us refer to these segments as 7ai, 7aii, 9ai, 9aii. (If the affordance made it easy to do, I'd certainly indicate subordinate Textspans / Syntagm Lenses to support the analysis.) So for "we," walking in the light corresponds with confession of sin; while for "he" walking in the light corresponds with being "faithful and just."

Earlier in the text, a reader may have been justified by asking of the author, "Okay, fine, you've asserted that God is light and walks in the light, but what do you mean by that?" It's not obvious from a casual reading of this passage that the author answers this question. But through visual and semantic analysis, the poetic balance becomes clear, and the analyst gains the interpretive key to understand that indeed the author does provide an answer to the question.

To me, discovering an interpretive pattern like this is truly thrilling. It's a fantastic payoff for the labor invested in the textual analysis.

Next article: Affordance critique

Friday
Jun262009

Reflowing Text

Having demonstrated Inductive Analysis on an individual verse, I proceeded to analyse each individual verse according to that pattern. But independent analysis of each verse gives little sense of the function that stretch of language within the text.

To get a sense of how the verses relate to one another, I turned to Tinderbox's Map View to reflow the text.

The first step is to locate the first verse to expand the box so that I can see the text. Then to get the next verse, expand its box, and lay arrange it in relationship to the first. I proceeded by interleaving the Inductive analysis with arranging the text.


Note expansion in Tinderbox's Map View to see the text content.In arranging the text, I had no particular schema to follow. The relationships I expressed were intuitive, and as I pursued the study, I "invented" differentiations to recall to mind the meanings I had perceived.

In addition to this, when I perceived that a particular stretch of text consisted of a theme (or, more formally, one generic element of structure), I added an adornment with a title to signal what I saw as the dominant function or meaning within the text. This is the result of arranging 1 John 1:1-5.

Notice in the the below example I placed 1 John 1:2 to the right of the main flow of the text. By this positioning, I indicated my understanding that 1 John 1:2 is an excursus.

First John 1:1-5. Arranged according to intuitive visual relationship with the analyst's interpretation of the dominant theme labeling the adornment.In examining this flow of text, the following illustrations hint at some functionality that I wish were available in Tinderbox. I've used an external drawing program to indicate chains of lexical relations.

First John 1:1-5. Demonstrating desired functionality to support lexical analyses.In formal terms, I am demonstrating an example of cohesion analysis. More than simply highlighting words, what I'm wanting to do is to mark semantic relations between chains of words, so that I could use Tinderbox's querying mechanism in order to retrieve the cohesive chains. Right now in Tinderbox, I can only choose to create named semantic relations between notes; or if I used text links, to have an unnamed relation, and then, to be able to make only one link, not several, as so often happens in analysis.

Next article: Discovering poetry

Thursday
Jun252009

Inductive Analysis


Having imported each verse in the target text into a separate note, I then generated a Tinderbox map view. Tinderbox's map view displays each text with its book name, chapter and verse reference, which is why I was so particular about setting up the text correctly for import.

A partial display of verses from First John, in the Tinderbox Map View.
But the text reference is not part of the text, merely a deictic reference. What I want to see is the text itself. So I locate 1 John 1:1, and press spacebar to open the note.

First John Chapter 1, Verse 1. Displayed in a Tinderbox note.
The first stage in this analysis is to carefully tease out the facts within the text, and then to make observations based on the facts. The final stage of the analysis is to note Application of the observation to the analyst's own life. (We won't be demonstrating this final stage in the analysis; the interaction technique is similar to the first two steps.) This style of analysis is known as the Inductive Technique in biblical studies. (As an aside, this weekend, I'm hosting a seminar in which this technique is being taught.)

Here is a completed analysis for the first verse of John. The Facts and Observations sections are both headlined by bolding the text. This is my approximation for demarcating multiple rich text field attributes, which Tinderbox does not at this time provide. The restatement of each known fact is part of the analytical technique, and is intended to make the analyst pay close attention to each and every fact. Information in the observations section is defined by the analytical technique as "that which stands out" to the analyst once the facts have been identified. Clearly, this is a very simple, subjective analytical technique which is well suited to repetitious data entry. In later examples within this project, I'll demonstrate more rigorous linguistic analyses that are less forgiving of busywork.

First John Chapter 1, Verse 1. Analysed with a partial application of an Inductive Technique, in a Tinderbox note.
It's not my purpose in at this point in to rigorously examine the limitations of the affordances in supporting this very simple form of textual analysis. I'll leave that on hold for the concluding article in this series.

Next Article: Reflowing Text

Wednesday
Jun242009

Assembling the Text

Given the obvious difficulty in working with either the manuscript or published text-form, I'm fortunate that the subject of this textual analysis is so readily available in digital form.

The first step in assembling the text into analysable form within Tinderbox is to acquire the my copy of the text. Given the significant investment Christians have made in making this text available, that is readily achieved. (I'm looking forward to the day when every published text is as readily available.)

Nipping over to Bible Gateway begins to get us close. Close because while my purpose is to analyse the scriptural text, this passage still has translator's headings, versification, footnotes and cross-references embedded in the text. Most of these have to be removed.

First John, Chapter 1, English Standard Version, as transmitted from Bible Gateway.

I copied and pasted this chapter, as well as each subsequent chapter, of First John into Microsoft Word.

First John, Chapter 1, English Standard Version, as pasted into Microsoft Word 2003 running in Windows XP Pro under Parallels v4 on Mac OS X 10.5.7.
Microsoft Word is invaluable in this regard, because it is the only non-programmatic way I know of searching and replacing text using regular expressions. The full list of regular expressions used are as follows:

Search Replace
\(?\) <space>
\(??\) <space>
\[?\] <space>


First John, Chapter 1, English Standard Version, in Microsoft Word 2003 having cross-references and footnotes removed.
At this point, I save the document into a different version, because I'm going to take a slight detour. While I definitely want to get the text into Tinderbox, I'd first like to do a quick and coarse corpus analysis; because, hey, I'm curious.

Quick Lexical Study

I want to rank the lexical items in order of use, so that I can get a quick check on the thematic focus of the author. The thing to do is to cleanse the text of anything that is not a lexical item.

Step one is to manually delete the headings. It's a short text, so it doesn't take too long. (It's quicker than figuring out how to make Microsoft Word return all paragraphs that do not contain a verse number. If I were examining, say, the book of Isaiah, I might just stop to figure this out.)

Search Replace
[0-9] <space>
--- <space>
, <space>
. <space>
" <space>
^p <space>

 

Then run this replacement multiple times until it returns a result of zero replacements.

Search Replace
<space><space> <space>


Finally, switch off wildcard search, and run this replacement.

Search Replace
? <space>


Select the entire text, and toggle Shift+F3 until it is all lowercased, to obtain the following text.

First John, Chapter 1, English Standard Version, in Microsoft Word 2003 having been purged of all non-lexical items.
Here's the part where it stops feeling like manual labour.

I searched for all spaces, replacing them with paragraph marks.

Search Replace
<space> ^p


Then sorted all paragraphs alphabetically, to get a complete list of all uses of each word, in alphabetical order.

Copy this entire text, and pasting it into Excel, gives us the power of using pivot to tables to automatically count the number of uses of each word.

The following image shows four areas of working. Column A contains the complete listing of all words copied from Microsoft Word. Columns D and E show the pivot table counts. I copy the pivot table into Columns G and H (Paste Special>Values) so that I can work with it. I manually stemmed the words so that abide, abides, abiding all get included in the one count. I then resorted Columns J and K according to descending numeric order. Finally, I was able to begin tagging the words according to parts of speech, in order to get at my target: lexical item counts.

Lexical Analysis of the words of First John, in Microsoft Excel 2003, running in Windows XP Pro under Parallels v4 on Mac OS X 10.5.7.

Results of Lexical Analysis

Most Frequent Verbs in First John
46 love
34 know
23 abide
?? sin [ NOTE 1 ]
13 commandment [ NOTE 2 ]
11 hear
11 keep
10 testify

[ NOTE 1:I'll have to look more closely to see how many instances refer to sinning (verb), sin (as state) or sin (as thing)]

[ NOTE 2:this is a nominalised verb; I'll have to look at the usages to determine the extent to which it should be treated as a verb. e.g. "keep the commandment" is verbal; whereas "when we sin against the holy commandments" would be a nominal use. ]

Most frequent Nouns or Pronouns in First John
83 we
66 god
57 you
51 him
45 he
42 his
39 us
23 world
22 son
?? sin [ NOTE 1 ]
15 brother
15 our
14 children
13 father
13 spirit
13 jesus

Most Frequent Adjectives in First John
7 darkness [ NOTE 3 ]
6 beloved[NOTE4]
6 eternal
6 evil [ NOTE 4 ]
6 light [ NOTE 4 ]

[ NOTE 3: A nominalised adjective. I'm interested in tagging these into concepts, not just parts of speech. ]
[ NOTE 4: I'll have to read the context to determine whether the usages are adjectival, nominal or verbal. ]

Tentative Conclusions

My gloss on the above analysis is that I might expect the themes in this text to be exceptionally-relationally focused. Almost every one of the top nouns are relational; virtually all the top verbs are too; and the adjectives draw strong contrasts between good and evil, so I might expect the text to thematise distinctly polar relationships, with some behavioural uses too.

Interlude

In the previous section, I could have described my process as simply cleansing the document and then running Athelstan's excellent MonoConc Pro!

But then not everyone reading this would be able to roll their own at home. Nor would I have a chance to point out the distinct benefits of digitising what can be quite laborious activity. Of course, that is my purpose here: To build a case for digitising certain practices that today occur manually. And just as important: notice that if I had used MonoConc Pro, I wouldn't have the power to freely stemmatize nor colour-code the results. It is just as vital to notice that flexible digital tools are just as important as having digital tools at all, for if digital tools have insufficient flexibility, the knowledge worker is forced to route around the damage ...

First John, English Standard Version, as a cleansed corpus file in Athelstan's MonoConc Pro, running in Windows XP Pro under Parallels v4 on Mac OS X 10.5.7.

Assembling the text in Tinderbox

Turning back to my saved copy of First John in Microsoft Word, I manually highlighted individual chapters, and ran an interesting but simple search and replace on each successive chapter.

Search Replace
(<[0-9][0-9]>) ^p1 John 1:\1^t
(<[0-9]>) ^p1 John 1:\1^t


The first statement says to find all two-digit numbers that comprise an entire word, and replace them with a paragraph mark, following by the text "first-John-chapter-one" then the verse number found by the search, then insert a tab. The second statement is just like the first, only it finds single-digit numbers.

The paragraph marks and tabs are very important for Tinderbox's processing.

First John, English Standard Version, fully versified, ready for import into Tinderbox.
I selected this entire text, and then pasted it into a Tinderbox note.

First John, English Standard Version, fully versified, in a Tinderbox note.
Now the work gets easy, because I simply use Tinderbox's Explode function, to generate a note for each verse in the entire text.

From a textual analysis point of view, having the text split into verses doesn't make a lot of sense. It would be better to break the verses into clause complexes (i.e. sentences) and clauses. But from a biblical studies perspective, verses are a usual method of demarcating textual components.

Tinderbox Outline, displaying one note for each verse in First John.
This image obviously reveals some post-processing that I've performed on the verses in the range 1 John 1:5 through 1:10. I'll share that with you in the next installment.

For now though, I'd just like to draw your attention to a few concordance-like word searches in Tinderbox.

Results stemming from Tinderbox agents.
Here is the abide agent.

Tinderbox agent, demonstrating searching for abide, abides, abiding.You may care to notice two small details.

Firstly, I've selected each note in the Tinderbox file containing a verse, and made it inherit from a prototype I've called Biblical Verse. This prototype contains absolutely no behaviour. I put it in place purely as a type marker. Of course, sometime in the future, if I want biblical verses to exhibit some look or behaviour, I have a ready facility for achieving that.

Secondly, that the order of the aliases is set to SiblingOrder, which therefore conforms it to the order of the biblical text. This is particularly important to sort verse 10 after verse 9, rather than following verse 1.

Next Article: Inductive Analysis

Tuesday
Jun232009

The Cline from Image to Text

I strongly agree with Audenaert's assertion that text and image form "two ends of a continuum rather than two poles of a dichotomy." Consider this range of possibility:

  • a photograph
  • an image of a painting
  • a photograph containing readable text
  • an image of a painting with some textual elements
  • artwork consisting mainly of textual representation
  • an image of text with illustrated elements
  • an image of text with annotations
  • a PDF containing text and image regions
  • a PDF consisting of a collage of texts by multiple authors
  • an image of text consisting purely of text written by one author
  • a PDF of text consisting purely of text written by one author

Purely in terms of logical analysis of this continuum, there doesn't appear to be any single point at which it is worthwhile demarcating text from image. Moreover, the following discussion will demonstrate that imagery and textuality are inherently related. Even the most plain-text of documents communicate through extra-linguistic semiota: font, spacing, pagination, formatting, relationship of headings with text, etc.

A purely textual manuscript

Consider the requirements for analysing the following, purely textual document. This manuscript would barely meet Audenaet's requirements to qualify as a visually complex document. Yet one would imagine the tools required to analyse the document would be similar to those employed in photographs or paintings (perhaps with an extra affordance here or there). Fortunately, there is no driver from linguistic or semiotic theory for separating the study of images and text: in the book, The Language of Displayed Art, Michael O'Toole describes how to analyse paintings, photographs and sculptures using analytical techniques borrowed from lingustics.

First Epistle of John (begins in right column). From: Codex Sinaiticus. 4th Century Majuscule. Care of CSNTM.org.

Plain text? or Collagic complexity?

Now consider the following image. The right-hand page consists purely of text. Despite the familiarity of the type-setting, and the evident bookishness, this image is arguably more complex than the manuscript above, because it is highly collagic. Consider just one parameter: the variety of authors and contributors to this one page:

  • The introductory material (top of page), and the notes (bottom of page) were written by a committee of ten academics from the Reformed theological tradition during the period 1988-1995.
  • The headings "The Word of Life", "Walking in the Light" and the translation notes at the bottom of the right-hand column were written by the ESV translation committee circa 2001.
  • The cross-references in the middle column were developed by a team of Bible scholars from Oxford and Cambridge Universities in the 19th Century, incorporating a cross-reference system developed by the translators of the 1611 King James Version.
  • The versification was designed by Robert Estienne in 1551.
  • This version of the biblical text was translated by the ESV translation committee circa 2001, from the United Bible Society's 1993 Greek New Testament (4th corrected ed.), which is a critical text derived from comparison of hundreds of manuscripts, including Codex Sinaiticus pictured above.

Despite this obvious complexity, we have referenced only one parameter out of the total parameter-space for analysing complexity. We've said nothing about the graphic layout, the punctuation, the paragraphing, the visual layout of the outline. While this clearly qualifies as a "visually complex document," it consists almost entirely of plain text.

First Epistle of John. From the Reformation Study Bible, English Standard Version. Published by Ligonier Ministries, Orlando, Florida.

Mechanics of document capture

I captured this image by placing my Bible on my CanoScan LiDE 25, and causing the CanoScan software to capture the image to PDF. So this image is actually an image of a PDF document. When I open the PDF document, I can clearly see that the CanoScan software has represented the image as text despite the obvious skew. But when I position my mouse pointer over the text that reads "That which was from the beginning", and then down into the second verse, the software highlights the middle column, the right-hand column and even some of the words from the facing page. The PDF may have recognized letters; it certainly does not recognize the columnar format.

Attempting to highlight an area of text using Apple's Preview software, on Mac OS X 10.5.7. Scanned from a CanoScan LiDE 25 using CanoScan software.
The down-ranking of meaning in PDF is even a challenge in documents that are printed directly to PDF and then transmitted electronically: the headers and footers, callouts, and other extra-linguistic textual features are intermingled within the text stream. PDF was designed to drive printers, not to facilitate textual analysis by knowledge workers.

At all levels of textual analysis, from authorship to graphology, from linguistic to extra-linguistic semiotic, from imagery or textuality, from source or down-stream digital processing tools, the cline between image and text is gradual. The analytical requirements are largely shared. The tool requirements for many types of text will likely draw from techniques developed for image analysis; image analysis will likely benefit from a common framework with a digital textual analysis tool, particularly one that incorporates notions of spatial hypertext.

Next Article: Assembling the Text