Data Creation

Author: Cody Kingham
Published: 29.07.17
Updated: 03.04.18

This project describes the ETCBC text processing pipeline in detail, from the beginning stages of converting an ancient text into its machine-readable format, all the way up to the encoding of text level data on textual hierarchy. The ETCBC data pipeline has been in a process of evolution for forty years. Over that period of time, many resources about the ETCBC have been published, but few have gone into detail on the processes which go into creating the data.

From a research standpoint, the lack of exposure on the data creation process presents a methodological problem of reproducibility. Especially in an age when text-processing is booming through the growing field of digital humanities, exposing those processes becomes even more important. Second, the introduction of the database to the internet with SHEBANQ in 2014 has greatly expanded the accessibility of the ETCBC; at the same time, this increases the need for clear documentation on the procedures used to label and produce the data. In the same vein, the inclusion of the data into various software packages requires a clear, succinct accounting of the data to further enhance its impact and use.

While this present project does not describe every detail about the programs’ innerworkings, it does provide the user with a good sense of how one would build an encoded text in the ETCBC format. The files in the example repository also provide specific reference points in the form of both encoded files and man pages on the programs which created them. The most important content is the description of the analysis files, wherein the output of the programs is represented.

The ETCBC data creation pipeline is run on a centralized server at the Vrije Universiteit Amsterdam, which researchers can log on to remotely. The server operates in the SunOS flavor of Unix. Users who wish to become even more familiar with the data creation processes should first get a basic understanding of how to work from the unix command line. Those who are working from a Mac system can already experiment on the command line by opening the Terminal app.

Most of the descriptions herein are collocated from the man (“manual”) pages in the ETCBC server (many by Constantijn Sikkel), accessed by entering man [program/filename] from the unix command line.

For a more detailed description on the rationale behind this project, read the internship report.

ETCBC Data Creation Pipeline


This diagram depicts the pipeline in the ETCBC’s data creation process. The pipeline can vary from project to project, and has changed over the years. The pipeline presented above represents a version used during the recent project on syntactic variation. The top level file, book.pil, contains the starting point for the process. The files below it in white make up the analysis files that hold and transmit the analyzed data. In the chart, green represents part of a process. Those processes are initiated with the commands indicated in green text in the center arrows (e.g. pil2wit, analyse, etc.). Thus, for example, provided that all the necessary input files are present, one can simply type in the command from the command line and initiate the process. Green boxes represent files utilized by the processes. The process files on the right are ported in from central locations on the server with UNIX commands like sccs. The process files on the left are generated by the pipeline itself.

1. Programs

The programs utilized by the data creation pipeline are well-documented in the ETCBC server man pages. A copy of each of the program’s man pages reflected in the pipeline above is available in the files repository on github.

2. Analysis Files

Each analysis file represents a stage in the data creation process, progressing from word-level to phrase and clause-levels. Analysis files are plain text, and are frequently operated on with additional UNIX commands (e.g. sed, awk, sort, etc.) not reflected here (to see examples, view the example Makefile.

Each file description below contains:
1. the command/location to view documentation in the ETCBC server
2. a sample of the file selected from the ETCBC encoding of the Mesha Inscription (Hebrew/Moabite)
3. a diagram of the specific parts of the file.

2.1 Raw Text Analysis

2.1.1 pil

The .pil (“Peshitta Institute of Leiden”) file is the first step in entering an ancient text into a computer-readable format. It contains the chapter and verse boundaries, transcription of the document text, and notations of variants/reconstructions found in other witnesses. The file also supports the notation of lacunae and fragmentary readings in a text.

The .pil file is generated by processing a plain text document, perhaps with the original utf8 characters copied and pasted from an original source (book.txt in the diagram above).

Source

plain text file, perhaps in unicode, processed into .pil format with sed, awk, cat, or other UNIX text processing commands

Documentation

Format of a PIL Running Text File or
/projects/calap/doc/format/format.pdf

Sample

@Mesa1 
1 'nk m$` bn km$yt  [km$/ 1A1, 1C1, 1D1] mlk [<??>mlk/ 2C1] m'b hdybny; 
2 'by mlk `l m'b $l$n $t w'nk mlkty 'Hr 'by; 
3 w'`$ hbmt z't lkm$ bqrHh; 
4 bmt y$` [bm<???>$`/ 1A1, 1D1] [b<????>$`/ 1C1] ky h$`ny mkl hmlkn wky hr'ny bkl $n'y; 
5 `mry mlk y$r'l wy`nw 't m'b ymn rbn ky y'np km$ b'rSh; 
...  
Parts
lines 1-2 of Mesa.pil

A more detailed description for each of these elements and some others, as well as the formatting rules for a .pil file, is in the documentation, Format of a PIL.

2.1.2 gt

The .gt (“graphical text”) file contains a cleaned version of the plain text document, stripped of its .pil notations and transliterated into the ETCBC transliteration. The file also contains “directives” or markers for the document’s language, book name, and verse labels.

Each row of text is preceded on the line above it with a verse header (%verse n,n). A blank newline separates individual verses.

The .gt file extension is sometimes omitted.

Source

generated by pil2wit

Documentation:

man -s5 gt

Sample

%bookname Mesa 
%language hebrew

%verse 1,1
>NK MC< BN KMCJT MLK M>B HDJBNJ

%verse 1,2
>BJ MLK <L M>B CLCN CT W>NK MLKTJ >XR >BJ 

...
Parts
header and verse 1 of Mesa.gt

Comments in a .gt file are prefixed with #.

2.2 Morphological Analysis

2.2.1 an

The .an (“analysis”) file breaks down the individual words into their morphological parts: prefix, core (the lexeme), infix (morphemes within the lexeme), suffix. The various morphemes are separated with characters. For instance, a prefix is set between exclamation marks: !J!QVL for יקטל. Verbal endings follow an opening brace ([), e.g. the suffix W in !J!QVL[W for יקטלו.

The file itself contains three columns with every word on its own row. The first “column” (technically only space-separated) contains the verse to which the word belongs. The second contains the surface form of the word. The third column contains the analytical representation of a word encoded with the ETCBC encoding.

The file is generated by matching the surface forms to a dictionary of previously analyzed words. The dictionary, an anzb (“Analytical ‘Zorg’ Book”) file, contains previously seen surface forms in one column with the encoding of their constituent parts in another. The user can compile anzb files from previous, composite anzbs, or they can compose their own based on their analysis of the text.

Source

generated by analyse

Documentation

The .an file has no individual man page.

For details on the encoding patterns, see man -s5 word_grammar.

A detailed description of the encoding convention used for the words is available in Verheij’s Grammatica Digitalis I. A shorter description for quick referencing can be found in Description of Quest II Data File Format.

Sample

1,1 >NK                       >NK(J
1,1 MC<                       M(JC</
1,1 BN                        BN/
1,1 KMCJT                     KMCJT/
1,1 MLK                       MLK/
1,1 M>B                       M(W>B/
1,1 HDJBNJ                    H-DJBNJ/
...  
Parts
verse 1 in Mesa.an

2.2.2 at

The .at (“analyzed text”) file contains the encoded morphology from .an, but it is separated into separate files for each chapter. From this file on, analysis is performed on a chapter by chapter basis (e.g. for the Pesher to Habbakuk: 1QpHab1.at, 1QpHab2.at, etc.)

In the .at file, there are two pieces of data per row. The first is a book, chapter, and verse label (e.g. Mesa 1,1). The second is the morphologically encoded words separated by spaces. Each row is variable length, and verses can run across multiple lines, in which case the book, chapter, and verse label is simply repeated.

The .at file also contains a language identifier at the beginning of a language in the document.

Source

generated by genat

Documentation

man -s5 at
also: man -s5 genat

Sample

%language hebrew

Mesa 1,1  >NK(J M(JC</ BN/ KMCJT/ MLK/ M(W>B/ H-DJBNJ/
Mesa 1,2  >B/-J MLK=[ <L M(W>B/ CLC/(JN C(N(H/T W->NK(J MLK=[TJ >XR/
Mesa 1,2  >B/-J
Mesa 1,3  W:n-!>!<F(H[ H-BM(H/T Z>T L-KM(WC/ B-QRXH=/
Mesa 1,4  BM(H/T JC<=/ KJ ]H](JC<[-NJ M(N-KL/ H-MLK/(JN W-KJ ]H]R>(H[-NJ
Mesa 1,4  B-KL/ FN>[/(J-J
...
Parts
verses 1-4 in Mesa1.at

2.3 Word Level Analysis

Word level analysis files contain lexical and parsing data for individual words. The parsing is calculated by intaking the morphological text from the .at file and by applying rules found in process files such as word_grammar, (seeman word_grammar).

2.3.1 ps2

The .ps2 (“phrase structure 2”) file contains the word parsing data in a columnar format. The same format is preserved throughout the remainder of the process as additional columns are simply added in the subsequent .ps files (.ps3, .ps4, and .PX).

Source

generated by at2ps

Documentation

man ps2

Sample

MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1 -1
MESA 01,01 MJC<                0   3 -1 -1 -1 -1 -1   -1 -1 -1  2  2
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2  0
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0  2
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2  0
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0  2
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1 -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0  0
                  *
...  
Parts
verse 1 in Mesa1.ps2

Every word in the .ps2 file is placed on its own row. Each column represents lexical, morphological, or parsing information which has been produced by at2ps (or sometimes syn02) based on the .an file. The data is stored as a simple integer. Unless otherwise noted below, -1 means that the category in question in not-applicable (e.g. there are no preformatives in the diagram above, all the words are set to -1). 0 can mean that the value is not present or is unknown, as with every term in the lexical set column above except for DJBNJ. However, a 0 in the part of speech column, for example, means that the word is an article.

The potential codes and their corresponding values in a .ps2 file are presented below, as derived from either the man pages of ps1, ps2, or morfset.

Values

Lexical set is evaluated by comparing both the lexical set column with the part of speech column. If the lexical set number is 0, then there is no value for lexical set. Else, the combinations are presented below.

lexical set
-6 2 distributive noun -2 1 copulative verb
-5 2 copulative noun -2 2 noun of multitude
-4 2 potential adverb -2 4 focus particle
-4 4 anaphoric adverb -2 12 iterrogative particle
-3 2 potential preposition -2 13 gentilic
-3 4 conjunctive adverb -1 1 quotation verb
-3 13 ordinal -1 2 cardinal

The rest of the columns are evaluated by simple correspondence to the id integer.

part of speech
0 article 5 preposition 10 interjection
1 verb 6 conjunction 11 negative
2 noun 7 personal pronoun 12 interrogative
3 proper noun 8 demonstrative pronoun 13 adjective
4 adverb 9 interrogative pronoun
preformatives
1 !! 4 !>!
2 !J! 5 !N!
3 !T! 6 !H! or !M! (Aramaic)
root formation
hebrew aramaic hebrew aramaic
1 ]] ]] 7 ]H2]
2 ]H] ]H] 8 ]H2] ]C]
3 ]N] ]HT2] 9 ]HCT] ]HCT]
4 ]2] ]HT] 10 ]HT2]
6 ]HT] 11 ]NT]
verbal ending
hebrew aramaic hebrew aramaic
1 [ [ 10 [NW [N>
2 [H [T 11 [= [=
3 [T [TH 12 [J [JN
4 [TH [TJ 13 [JN [WN
5 [T= [T= 14 [WN [N
6 [TJ [W 15 [NH [J
7 [W [H 16 [NH=
8 [TM [TWN 17 [2
9 [TN [TN
nominal endings
hebrew aramaic hebrew aramaic
1 / / 7 /H= /T
2 /H /= 8 /JM2 /H
3 /T /JN 9 /J2 /JN2
4 /JM /J 10 /2
5 /J /T= 11 //
6 /WT /N
pronominal suffixes
hebrew aramaic hebrew aramaic
1 + + 9 +NW +N>
2 +NJ +J 10 +KM +KWN
3 +J +NJ 11 +KN
4 +K +K 12 +HM
5 +K= 13 +M +HWN
6 +W +H= 14 +MW
7 +HW +HJ 15 +HN +HN
8 +H +H 16 +N
verbal tense
-1 NA 5 infinitive absolute
1 imperfect 6 participle
2 perfect 11 wayyiqtol
3 imperative 12 weyiqtol
4 infinitive construct 62 passive participle
person
-1 NA 2 second
0 unknown 3 third
1 first
number
-1 NA 2 du
0 unknown 3 pl
1 sg
gender
-1 NA 1 feminine
0 unknown 2 masculine
state
-1 NA 2 absolute
0 unknown 3 emphatic
1 construct

2.4 Phrase Level Analysis

2.4.1 ps3

The .ps3 file contains boundaries and features for phrase atoms. New data in the .ps3 records the phrase dependent part of speech (for words within the phrase), phrase type, and phrase determination through an integer value.

The .ps3 file also contains the base information of .ps2, with some modifications. The state column, for instance, is re-evaluated on the basis of the phrase-level syntax and moved toward the phrase atom block.

Source

generated by syn03 with user interaction

Documentation

man ps3

Sample


MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2
MESA 01,01 MJC<                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   0   0  -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0     2   3  -2   2
          *
...
Parts
verse 1 in Mesa1.ps3

Note that in the .ps3 file, the state column has shifted right into the phrase atom block. Note also that some of the values for state have changed in light of the phrase level analysis. To illustrate, compare the 5th word in the file, MLK in the .ps2:

MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2  0

The last column reads 0, meaning that MLK, in its surface form, is in an unknown state (since the surface form מלך can be either absolute or construct). But in the .ps3 file, the value in the state column (col. 12) has now changed to 1, i.e. construct, because of that term’s relation to the following word, MW>B (“Moab”; thus, מלך מועב “king of Moab”).

MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2   1   2   0  -1 

Phrase atom boundaries are implicitly communicated through the presence of a value for phrase atom type. Thus, only the last word of a phrase atom is marked in the .ps3 through any phrase type value other than 0. The corresponding features of the integer value can be seen in the table below. If a value is negative, it communicates a phrase atom apposition of the same correponding positive value.

For instance, in the sample above BN has a phrase type value of 0 (i.e. second to last column) which immediately follows a positive value in the row above. This means the BN begins a phrase atom. The next 4 words have a value of 0 (null) and thus belong to the phrase atom marked off by BN. The last word in the phrase atom (DJBNJ) is marked off with a value of -2, meaning that the phrase atom is in apposition (since it's negative) and that it is a nominal phrase atom (since the absolute value is 2, see the table below). So the whole phrase atom is: "BN KMCJT MLK MW>B H DJBNJ" which is a nominal phrase in apposition to the preceding phrase atom, MJC<.

Values

The values for state and part of speech (phrase dependent) remain the same as in .ps2. The other column values are:

phrase type
1 verbal phrase 7 personal pronoun phrase
2 nominal phrase 8 demonstrative pronoun phrase
3 proper-noun phrase 9 interrogative pronoun phrase
4 adverbial phrase 10 interjectional phrase
5 prepositional phrase 11 negative phrase
phrase determination
-1 NA
1 undetermined
2 determined

2.4.2 ps3.p

The .ps3.p (“.ps3 parsed”) file contains data on subphrases and subphrase relations. The subphrase allows for recursive embedding (limited to 3 levels on a single word) of various subphrase relationships. For instance, they record relationships between individual elements of a phrase, such as a nomen rectum/regens construction. The .ps3.p file contains the same columns of a .ps3 plus three additional columns.

Source

generated by parsephrases alongside user modifications

Documentation

man ps3.p

Sample

MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1     -1   7   7   2       -1      -1      -1
MESA 01,01 MJC<                0   3 -1 -1 -1  1 -1   -1 -1  1  2      2   3   3   2       -1      -1      -1
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2      1   2   0  -1        2      -1      -1
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0      2   3   0  -1   -10002     106      -1
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2      1   2   0  -1        2      -1      -1
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0      2   3   0  -1   -10002  -20106     306
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1     -1   0   0  -1       -1      -1      -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0      2   3  -2   2   -20106      -1      -1
           *
Parts
verse 1 in Mesa1.ps3.p

The .ps3.p file introduces three new columns to the .ps3 structure, a subphrase block. Because a word can appear within as many as three different subphrases (and more—this is a shortcoming of the .ps3.p format), three columns are provided for a given word. Only the endpoint of a subphrase is explicitly marked, while the beginning of the subphrase must be calculated. The endpoint is indicated with a subphrase code. In the chart above, any word (i.e. row) with a value other than -1 (NA) represents the end of a subphrase. Note that in the case of the word MW>B (row 6), three different subphrase endpoints are marked.

The subphrase codes contained in the .ps3.p require some extended explanation.

The codes encapsulate data about the beginning point of the subphrase, distance from the subphrase’s mother, and the relationship to the mother. That information is calculated with the following formulae, wherein abs stands for absolute value, / is the division sign, * for multiplication, and % is for the modulo:

  1. distance to beginning point
           -1 * (abs(code) % 10000) / 100
  2. distance to mother
           code / 10000
  3. relation to mother
           abs(code) % 100

Using this information, we can interpret the three subphrases stored for the word MW>B. The subphrase codes for MW>B are -10002, -20106, and 306. Below is the interpretation for the code -10002. Note that only the whole number from the division is considered in the calculations (e.g. 0 instead of -0.02):

-10002 distance to beginning -1 * (abs(-10002) % 10000) / 100 = -0.02 or 0 Distance to beginning is 0. MW>B both begins and ends a subphrase.
-10002 dist. to mother -10002 / 10000 = -1.0002 or -1 Distance to mother is back one word. This makes MLK the mother word.
-10002 relationship abs(-10002) % 100 = 2 The relationship of MW>B to MLK is a 2, which evaluates to either a nomen regen or nomen rectum (see the codes below). Since MW>B has a mother (MLK), it must be a nomen rectum.

Based on this information, we can reconstruct the subphrase. The first word in the subphrase is MW>B itself and the subphrase also ends with MW>B (since it holds the subphrase code). The entire phrase is then MW>B. Finally, the relationship of MW>B to MLK is that of a nomen rectum or genitive (as in MLK MW>B, or “king of Moab”).

The other two codes can be evaluated in the same way. The code -20106 evaluates as: dist. to beginning = -1, dist. to mother = -2, relationship code = 6. Thus, the first word in this subphrase is MLK. The subphrase ends with MW>B (since it holds the subphrase code). The entire phrase is then MLK MW>B. The mother of the phrase is -2 words away, which places it as KMCJT (“Kemeshyat”). And its relationship to its mother is code 6, which means parallel. Since Kemeshyat itself marks the end of a subphrase, this subphrase is also parallel with the subphrase of KMCJT: so “I am Mesa, son of Kemeshyat // king of Moab”.

In the same way, the final code 306 evaluates as beginning = -3, mother = 0, rela = 6. The first word of the subphrase is 3 words back, thus BN. The whole subphrase is: BN KMCJT MLK MW>B (“son of Kemeshyat, King of Moab”). It is a mother subphrase, since itself does not have a mother (0). Its relationship to its daughter is 6 or parallel. If we move down the .ps3.p two rows and evaluate DJBNJ and its subphrase code of -20106, we see that it is the daughter of this subphrase, and that its relationship is also parallel: BN KMCJT MLK MW>B // H DJBNJ (“Son of Kemeshyat, King of Moab” // “the Dibonite”).

Values

The values for the relationship codes are provided below.

relationship codes
2 regens/rectum 6 parallel
4 modifier 8 demonstrative
5 adjunct 13 attribute

2.5 Clause Level Analysis

2.5.1 ps4

The .ps4 file contains clause atom divisions as established by syn04 with user input. .ps4 utilizes the data from the .ps3 (n.b., not .ps3.p). It repurposes the placeholder asterisks in the .ps3 file, which previously marked only the end of a verse, to the end of each clause atom. An additional column is added to the .ps4 format, but that data is currently deprecated and no longer referenced.

Source

generated by syn04 with user interaction

Documentation

man ps4

Sample

MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2    -1
MESA 01,01 MJC<                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2    -1
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1    -1
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1    -1
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1    -1
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1    -1
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   0   0  -1    -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0     2   3  -2   2    -1
            *
MESA 01,02 >B                  0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1    -1
MESA 01,02 J                  -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   2   2    -1
MESA 01,02 MLK=                0   1  0  0  1 -1 -1    2  3  1  2    -1   1   1  -1    -1
...
Parts
verse 1-2a in Mesa1.ps4

Though the asterisk has only served to demarcate verse boundaries up to this point in the .ps files, it now demarcates clause atom boundaries. Though the boundaries in this case did not change for verse 1 of Mesa1.ps4, it did change for verse 2. Compare verse 2 in Mesa1.ps3 to Mesa1.ps4 in the examples repository. The deprecated “catm_flag” column, no longer transmits any valuable data and can be disregarded (see man syn04 for more information).

2.5.2 ps4.p

The .ps4.p (“ps4 parsed”) file contains data on the clause constituents, i.e. phrases, and their functions within the clause atoms. It also contains data on phrase-internal relations at the level of phrase atoms. For instance, a noun phrase from the .ps3 file might be evaluated as a subject phrase within the clause established in .ps4. Additionally, a subject phrase may have another noun phrase which functions in apposition to it which is categorized as a phrase atom relation.

The .ps4.p file combines the subphrase block of the ps3.p with the clause divisions of .ps4.

Source

generated by parseclauses with user interaction

Documentation

man ps4.p and man -s5 ct (for constituent codes)

Sample

MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  502    -1
MESA 01,01 MJC<                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2      -1      -1      -1    0  521    -1
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1       2      -1      -1   -1   -1    -1
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1  -10002     106      -1   -1   -1    -1
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1       2      -1      -1   -1   -1    -1
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1  -10002  -20106     306   -1   -1    -1
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   0   0  -1      -1      -1      -1   -1   -1    -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0     2   3  -2   2  -20106      -1      -1  -11  500    -1
           *
Parts
verse 1 in Mesa1.ps4.p

In the figure above, one can see that the new clause constituent columns introduced by .ps4.p are inserted between the subphrase columns from .ps3.p and the defunct clause column from .ps4. The clause boundary asterisk has also been merged into the new file.

Like the codes utilized in .ps3.p, the codes in .ps4.p require some additional explanation.

There are two columns of new data: distance to mother and constituent codes. The two columns convey either:

  1. clause constituent functions (e.g. subject or predicate phrases, etc.) or
  2. a phrase atom relationship to another phrase (e.g. apposition, parallel, etc.).

Like the .psp.3 file with subphrases, only the endpoint of a phrase is indicated, while the start point must be inferred. In this case, the ending of a phrase is marked by a word with a distance value other than -1 (which indicates a null value). For instance, both >NKJ and MJC< (rows 1, 2) function as single word phrases, since they are preceded by no unmarked words, and since they both have a distance value other than -1. DJBNJ (last row) also marks the end of a phrase, but it is preceded by 5 words with null values. It is inferred, then, that those words belong to the phrase demarcated by DJBNJ (the full phrase being BN KMCJT MLK MW>B HDJBNJ, “son of Kemeshyat, king of Moab, the Dibonite”).

If the phrase boundary has a distance of 0, it represents a clause constituent. Its constituent code in the second column can then be referenced against the clause constituents table below. For instance, the first phrase, >NKJ, is marked 0 and is thus a clause constituent. Its constituent code is 502, which stands for a subject phrase (Subj).

For phrase endpoints with a distance less than 0, they convey a phrase atom relationship to another phrase atom. In this case, the distance code must be further parsed to determine the phrase atom’s mother. The distance conveyed may be expressed in word, phrase atom, or clause atom units. To determine both the distance and which unit is being used, follow the flowchart below:

  • if code ≥ 100
    • distance unit = word
    • distance = code +/- 100 (+ for negative codes, - for positive, etc.)
  • if 100 > code ≥ 10
    • distance unit = phrase atom
    • distance = code +/- 10
  • if 10 > code ≤ 1
    • distance unit = clause atom
    • distance = code +/- 1
  • 0 is null

The relationship to the mother is expressed in the subsequent, constituents column. Simply confer with the phrase atom relations table below for the value of the code.

To illustrate, DJBNJ has a non-zero value for distance, and thus stores a phrase atom relation. The code is -11. Since the code’s absolute value is greater than 10, we know that the units of distance conveyed by the distance code is in phrase atoms. That distance is calculated by adding 10 to the code. The distance to the mother phrase atom is thus -1 phrase atoms. If we count back one phrase atom from the beginning of this phrase atom (BN KMCJT MLK MW>B HDJBNJ), we see that the mother is the predicate complement phrase atom, MJC< (“Mesa”). When we look up DJBNJ’s constituent code in the phrase atom relations table, we see that the relationship conveyed (code 500) is apposition. Thus, the phrase marked off by DJBNJ, BN KMCJT MLK MW>B HDJBNJ (“son of Kemeshyat, king of Moab, the Dibonite”) functions in apposition to its mother phrase atom, MJC< (“Mesa”).

Values
clause constituent
599 Unknown 507 Locative 501 Predicate
505 Adjunct 508 Modifier 531 Predicate with object suffix
504 Complement 528 Modifier with subject suffix 532 Predicate with subject suffix
509 Conjunction 540 Negative copula 534 Participle with object suffix
541 Enclitic personal pronoun 542 Negative copula with subject suffix 511 Question
552 Existence with subject suffix 510 Negation 519 Relative
550 Existence 503 Object 502 Subject
572 Fronted element 525 Predicative adjunct 515 Supplementary constituent
512 Interjection 523 Predicate complement with subject suffix 506 Time reference
522 Interjection with subject suffix 521 Predicate complement 562 Vocative
phrase atom relations
500 apposition
567 link
566 parallel
535 suffix specification
582 specification

2.6 Text Level Analysis

2.6.1 PX

The .PX (“parsed text”) file contains data at the highest level of analysis, the text level. This includes clause atom relations, clause atom hierarchy, text type, and clause atom type. It also contains new data for functional units such as a clauses (made up of atoms), sentences, and paragraphs, with data for each of those units including clause type, sentence number, and paragraph number.

The .PX file also contains the previously analyzed data from the ps4.p, which itself had combined the clause atom data with phrase level data. .PX thus gives an overview of the full text, and through the clause hierarchies (combined with the newly generated .CTT file, see below) it offers new insight into the text as a whole unit.

Source

generated by syn04types with user interaction

Documentation

man PX
man usertab
man CARC

Sample

MESA 01,01 >NKJ                0   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  502     0
MESA 01,01 MJC<                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2      -1      -1      -1    0  521     0
MESA 01,01 BN                  0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1       2      -1      -1   -1   -1    -1
MESA 01,01 KMCJT               0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1  -10002     106      -1   -1   -1    -1
MESA 01,01 MLK                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     1   2   0  -1       2      -1      -1   -1   -1    -1
MESA 01,01 MW>B                0   3 -1 -1 -1  1 -1   -1 -1  1  0     2   3   0  -1  -10002  -20106     306   -1   -1    -1
MESA 01,01 H                   0   0 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   0   0  -1      -1      -1      -1   -1   -1    -1
MESA 01,01 DJBNJ              -2  13 -1 -1 -1  1 -1   -1 -1  1  0     2   3  -2   2  -20106      -1      -1  -11  500     0
           *  0   1 120   7 100  19 470  52 120   0   0  .N  0 LineNr      1 ClauseNr    1:   1:   2: 200:   0   0 SentenceNr     1 TxtType: Q       Pargr: 1          ClType:NmCl
Parts
verse 1 in Mesa1.PX

The asterisk has up to now only segmented clause atoms. It is now followed by a row of 14 data columns. The columns on the “star line,” as the row is known, break down further into 12 columns of linguistic data and 2 columns of zero padding that simply serve to segment the data (helpful for parsing the end of the list with computer code). Each of those columns are described below.

beginning of data
zero padding to mark the beginning of the clause atom data

clause atom relation list
The clause atom relation list varies in length, depending on how many relationships the clause atom shares with other clause atoms. Within the list, clause atoms that are related to the clause atom at hand are represented with pairs of digits separated by a single space.

The first digit contains the distance (in clause atoms) from the present clause atom to the related mother/daughter atom. If the distance is negative, the relationship is upward in the hierarchy tree; if it is positive, the relationship is downward in the tree. Most clause atoms have a mother clause atom (no more than one). To find it, get the clause atom from the list with the greatest negative distance.

There are a few exceptions to the instructions above for clause atoms that either serve as the root (such as in the example above!) or have a downward connection. Root clause atoms have no mother and will usually have an “instructions” value of N (no connection; see two fields to the right, second character). Occasionally a root, motherless atom might also have an instructions value of \ (downward relation). But a downward connection might also indicate a connection to a mother, depending on the nature of the clause atom it is being related to. It must be determined whether the related atom or the clause atom at hand functions as the mother for the the following atoms in the tree. If the related atom is indeed a mother, it will be found by taking the clause atom with the greatest positive distance.

The second digit is a relationship code (“CARC”, or clause atom relation code) which conveys the specific nature of the relationship. The code is comprised of three digits: the first digit refers a lemma class (such as a conditional conjunction or parallel conjunction), the second digit refers to the verb type of the dauther clause’s main verb, and the third digit refers to the present clause’s (the mother) main verb. To get the corresponding value of those codes, refer to the extensive tables and descriptions available in the Text Fabric clause relations documentation or the CodesList files in the examples repo.

To illustrate using the example above, the first clause atom contained in Mesa 1:1 has 4 clause atom relations in the list. The distances for all of the relations are positive, 1,7, 19, and 52 and the instructions value (two fields over) is N which means that there is no mother clause atom. The first daughter is 1 clause atom below the present atom. Its relation code for the first daughter is 120, which means that the daughter is an asyndetic clause (1--) with a perfect verb (-2-) whose mother clause (the present one) has a no verb (--0). The second daughter is seven atoms down, has a code of 100, which means it is asyndetic (1--), verbless (-0-), connected to the present verbless clause (--0). The third daughter is nineteen atoms down, has a code of 470, which means it has a coordinating conjunction (4--), a wayyiqtol main verb (-7-) and whose mother (this one) is verbless (--0).

To provide a brief example of a mother relationship with negative distance in the subsequent clause atom’s star line:

*  0  -1 120   1 422   0   0  ..  4 LineNr      2 ClauseNr    1:   1:   4: 122:   0   0 SentenceNr     2 TxtType: Q       Pargr: 1          ClType:XQtl 

The first clause atom in the relation list is -1 120, which means that the mother is back one clause atom, and is asyndetic (1--). The daughter clause (the present one) has a perfect verb (-2-) and the mother has no verb (--0).

data separator
The space-separated doube zero, 0 0, signals the end of the clause atom relations list and the beginning of the instructions data.

instructions
Instructions represents data for special kinds of clause atoms in the textual hierarchy. There are two slots, which each describe a certain kind, or subtype, of clause atom. Subtype 1 describes why a clause atom does not have a predicate, if it indeed does not. This corresponds to special types of clause atoms such as ellipsis, casus pendens, etc. The second subtype indicates any special status of the clause atom in the hierarchy, with values such as q for direct speech, # for a new paragraph, or e for embedding (also N for no connections, which we have seen already). All of the possible values are below in the instructions table.

tab/indentation
This field contains a simple integer which describes how many tabs in the hierarchy the clause atom is to be indented.

line number
This field derives from the .usertab file (see usertab) where it is used to check the integrity of the file. Since these numbers are consecutive for every clause atom, it might be used in the .PX file to count which clause atom is being referred to.

clause number
This feature introduces the functional clause. Clauses are numbered consecutively within a sentence. The clause number can also be used to identify which clause a clause atom belongs to. For example, if there are numerous clauses and clause atoms within a sentence, one would first locate the sentence number and then the clause number. An example from exodus38.PX (vs. 26) helps to illustrate—only the star lines are given:

*  0  -1 100   1 100   2 223   0   0  ..  6 LineNr     68 ClauseNr    1:   1:   1: 200:   0   0 SentenceNr    45 TxtType: N       Pargr: 2          ClType:NmCl
*  0  -1 100   0   0  .e  8 LineNr     69 ClauseNr    1:   1:   1: 200:   0   0 SentenceNr    46 TxtType: N       Pargr: 2          ClType:NmCl
*  0  -2 223   1  16   2 220   0   0  d.  7 LineNr     70 ClauseNr    1:   2:   2: 200:   0   0 SentenceNr    45 TxtType: N       Pargr: 2          ClType:Defc
*  0  -1  16   0   0  .e  9 LineNr     71 ClauseNr    2:   1:   4: 106:  -2 -1011 SentenceNr    45 TxtType: N       Pargr: 2          ClType:Ptcp
*  0  -2 220   0   0  d.  8 LineNr     72 ClauseNr    1:   2:   2: 200:   0   0 SentenceNr    45 TxtType: N       Pargr: 2          ClType:Defc

First, note that these are consecutive clause atoms (see LineNr 68, 69, 70, 71, and 72). Also see the sentence numbers. They are numbered: 45, 46, 45, 45, 45. There are two sentences total, with 46 embedded within 45. And sentence 45 also has several clauses! The clause numbers within sentence 45 are: 1 and 2. By bringing both the sentence number and clause number together one can identify which clause a given clause atom belongs.

As an aside, the clause number and the three other fields following it (the least/greatest phrase numbers and the clause constituency code) are all followed by colons due to their original representation in the .usertab file where the colon originally marked off a string label.

least & greatest phrase number
This number contains the least and greatest phrase (functional) number within a clause atom. Phrases are numbered consecutively within a clause atom, beginning with 1. Thus, normally the least phrase number is 1. In the Mesa example above, the least phrase number is 1: and the greatest is 2: . Therefore, the number of total phrases in that clause atom is 2.

There are cases, however, where a clause atom does not contain a complete phrase, since the phrase begins in a previous one. If there are other, complete phrases in the clause atom, the incomplete phrase is ignored and the numbering starts from the complete phrase. However, there are cases where a given clause atom does not have a single complete phrase at all. In that case, the phrase is numbered in accord with the previous clause atom. An example from genesis01.PX (vs. 7) will illustrate:


*  0  -1  10   0   0  .e  6 LineNr     21 ClauseNr    2:   1:   2: 200: -13 -1006 SentenceNr    19 TxtType: ?N      Pargr: 122        ClType:NmCl
*  0  -2 223   1  10   0   0  d.  5 LineNr     22 ClauseNr    1:   3:   3: 157:   0   0 SentenceNr    19 TxtType: ?N      Pargr: 122        ClType:Defc

There are two clause atoms. Note that in the first clause atom, there are two complete phrases (1: 2:). In the second clause atom, however, there is no complete phrase. The numbering from the first clause atom therefore carries over to the second, and the least and greatest phrase in the second clause atom is numbered as 3: 3:. Note also that even though this phrase begins in the first clause, it is not registered in the first clause since it already has two complete phrases. Only in cases where a clause atom contains no other complete phrase is the phrase numbering carried over from the previous clause atom.

clause type
The clause type (functional) field contains an integer which corresponds to a type of clause. The values are provided below in the clause atom/clause type table. Note that the clause atom and clause share clause type codes, but the clause atom type is stored as a string whereas the clause is stored here as an integer.

clause constituent relation
The two integers in the clause constituent relation column contain data on relationships between two functional clauses. If there are multiple clause atoms in a clause, the constituent relation is only stored on the first clause atom.

The first integer describes the relation and can be referenced in the clause constituent relation table below.

The second integer describes the distance to the related clause, which can be conveyed in either words, phrase atoms, clause atoms, or sentence atoms. Use the flow chart below to calculate both the unit and distance.

  • if code ≥ 1000
    • distance unit = word
    • distance = code +/- 1000 (+ for negative codes, - for positive, etc.)
  • if 1000 > code ≥ 100
    • distance unit = phrase atom
    • distance = code +/- 100
  • if 100 > code ≥ 10
    • distance unit = clause atom
    • distance = code +/- 10
  • if 10 > code ≤ 1
    • distance unit = sentence atom
    • distance = code +/- 1
  • 0 is null

sentence number
Sentences are numbered consecutively within chapters.

text type
The text type is a clause feature for “narrative”, “discourse”, “quotation” or and combination/embedding of those features. The feature is repeated for every clause atom within that clause. The possible values are related below in the text type table.

paragraph number
Paragraphs are numbered consecutively within a text segment. A text segment begins at the root of the clause hierarchy. Paragraphs can be nested. As an example, paragraph 12 would refer to a nested paragraph 2 within a paragraph 1.

clause atom type
The clause atom type is a simple string conveying the internal structure of the clause atom.

Values
instructions
subtypes 1 subtypes 2
. NA . NA
c casus pendens N no connection
d defective \ downward connection
l ellipsis e embedding
m macrosyntactic sign p proleptic ellipsis
r reopening q direct speech
clause atom/clause types
0 Defc Defective clause atom
99 Unkn Unknown
101 ZYq0 Zero-yiqtol-null clause
102 ZQt0 Zero-qatal-null clause
103 ZIm0 Zero-imperative-null clause
104 InfC Infinitive construct clause
105 InfA Infinitive absolute clause
106 Ptcp Participle clause
111 ZYqX Zero-yiqtol-X clause
112 ZQtX Zero-qatal-X clause
113 ZImX Zero-imperative-X clause
121 XYqt X-yiqtol clause
122 XQtl X-qatal clause
123 XImp X-imperative clause
131 xYq0 x-yiqtol-null clause
132 xQt0 x-qatal-null clause
133 xIm0 x-imperative-null clause
141 xYqX x-yiqtol-X clause
142 xQtX x-qatal-X clause
143 xImX x-imperative-X clause
151 WYq0 We-yiqtol-null clause
152 WQt0 We-qatal-null clause
153 WIm0 We-imperative-null clause
157 Way0 Wayyiqtol-null clause
161 WYqX We-yiqtol-X clause
162 WQtX We-qatal-X clause
163 WImX We-imperative-X clause
167 WayX Wayyiqtol-X clause
171 WXYq We-X-yiqtol clause
172 WXQt We-X-qatal clause
173 WXIm We-X-imperative clause
181 WxY0 We-x-yiqtol-null clause
182 WxQ0 We-x-qatal-null clause
183 WxI0 We-x-imperative-null clause
191 WxYX We-x-yiqtol-X clause
192 WxQX We-x-qatal-X clause
193 WxIX We-x-imperative-X clause
200 NmCl Nominal clause
213 AjCl Adjective clause
301 Voct Vocative clause
302 CPen Casus pendens
303 Ellp Ellipsis
304 MSyn Macrosyntactic sign
305 Reop Reopening
306 XPos Extraposition
clause constituent relations
-13 Attr Attributive clause
-6 Coor Coordinated clause
-5 Spec Specification clause
-2 RgRc Regens/rectum connection
502 Subj Subject clause
503 Objc Object clause
504 Cmpl Complement clause
505 Adju Adjunctive clause
521 PreC Predicate complement clause
525 PrAd Predicative adjunct clause
562 ReVo Referral to vocative
572 Resu Resumptive clause
text types
? Unknown
D Discursive
N Narrative
Q Quotation

2.6.2 CTT

The .CTT (“coded text tabulated”) file does not contain any new data on the text, but its contents are of special importance for presenting, using, and sharing the completed analysis of a text. The file contains a hierarchical layout of the text by entering in the indentations from the .PX file. There is also information on clause atom constituents and their functions.

Source

generated by syn04types with user interaction

Documentation

man CTT

Sample and Parts
verses 1-6 in Mesa1.CTT

The columns as they appear here are:
0. Line number (not an official part of the file format)
1. Verse Label
2. Person/Number/Gender of the predicate
3. Clause Atom Type of the daughter
4. Indication of the mother
5. Text Type
6. Paragraph Number
7. Clause Atom Number
8. Tabulation and Subtypes
9. Hierarchy made with the surface text from ct4.p