Author: Cody Kingham Published: 29.07.17 Updated: 03.04.18
This project describes the ETCBC text processing pipeline in detail, from the beginning stages of converting an ancient text into its machine-readable format, all the way up to the encoding of text level data on textual hierarchy. The ETCBC data pipeline has been in a process of evolution for forty years. Over that period of time, many resources about the ETCBC have been published, but few have gone into detail on the processes which go into creating the data.
From a research standpoint, the lack of exposure on the data creation process presents a methodological problem of reproducibility. Especially in an age when text-processing is booming through the growing field of digital humanities, exposing those processes becomes even more important. Second, the introduction of the database to the internet with SHEBANQ in 2014 has greatly expanded the accessibility of the ETCBC; at the same time, this increases the need for clear documentation on the procedures used to label and produce the data. In the same vein, the inclusion of the data into various software packages requires a clear, succinct accounting of the data to further enhance its impact and use.
While this present project does not describe every detail about the programs’ innerworkings, it does provide the user with a good sense of how one would build an encoded text in the ETCBC format. The files in the example repository also provide specific reference points in the form of both encoded files and man pages on the programs which created them. The most important content is the description of the analysis files, wherein the output of the programs is represented.
man
The ETCBC data creation pipeline is run on a centralized server at the Vrije Universiteit Amsterdam, which researchers can log on to remotely. The server operates in the SunOS flavor of Unix. Users who wish to become even more familiar with the data creation processes should first get a basic understanding of how to work from the unix command line. Those who are working from a Mac system can already experiment on the command line by opening the Terminal app.
Most of the descriptions herein are collocated from the man (“manual”) pages in the ETCBC server (many by Constantijn Sikkel), accessed by entering man [program/filename] from the unix command line.
man [program/filename]
For a more detailed description on the rationale behind this project, read the internship report.
This diagram depicts the pipeline in the ETCBC’s data creation process. The pipeline can vary from project to project, and has changed over the years. The pipeline presented above represents a version used during the recent project on syntactic variation. The top level file, book.pil, contains the starting point for the process. The files below it in white make up the analysis files that hold and transmit the analyzed data. In the chart, green represents part of a process. Those processes are initiated with the commands indicated in green text in the center arrows (e.g. pil2wit, analyse, etc.). Thus, for example, provided that all the necessary input files are present, one can simply type in the command from the command line and initiate the process. Green boxes represent files utilized by the processes. The process files on the right are ported in from central locations on the server with UNIX commands like sccs. The process files on the left are generated by the pipeline itself.
book.pil
sccs
The programs utilized by the data creation pipeline are well-documented in the ETCBC server man pages. A copy of each of the program’s man pages reflected in the pipeline above is available in the files repository on github.
Each analysis file represents a stage in the data creation process, progressing from word-level to phrase and clause-levels. Analysis files are plain text, and are frequently operated on with additional UNIX commands (e.g. sed, awk, sort, etc.) not reflected here (to see examples, view the example Makefile.
sed
awk
sort
Each file description below contains: 1. the command/location to view documentation in the ETCBC server 2. a sample of the file selected from the ETCBC encoding of the Mesha Inscription (Hebrew/Moabite) 3. a diagram of the specific parts of the file.
The .pil (“Peshitta Institute of Leiden”) file is the first step in entering an ancient text into a computer-readable format. It contains the chapter and verse boundaries, transcription of the document text, and notations of variants/reconstructions found in other witnesses. The file also supports the notation of lacunae and fragmentary readings in a text.
The .pil file is generated by processing a plain text document, perhaps with the original utf8 characters copied and pasted from an original source (book.txt in the diagram above).
book.txt
plain text file, perhaps in unicode, processed into .pil format with sed, awk, cat, or other UNIX text processing commands
cat
Format of a PIL Running Text File or /projects/calap/doc/format/format.pdf
/projects/calap/doc/format/format.pdf
@Mesa1 1 'nk m$` bn km$yt [km$/ 1A1, 1C1, 1D1] mlk [<??>mlk/ 2C1] m'b hdybny; 2 'by mlk `l m'b $l$n $t w'nk mlkty 'Hr 'by; 3 w'`$ hbmt z't lkm$ bqrHh; 4 bmt y$` [bm<???>$`/ 1A1, 1D1] [b<????>$`/ 1C1] ky h$`ny mkl hmlkn wky hr'ny bkl $n'y; 5 `mry mlk y$r'l wy`nw 't m'b ymn rbn ky y'np km$ b'rSh; ...
A more detailed description for each of these elements and some others, as well as the formatting rules for a .pil file, is in the documentation, Format of a PIL.
The .gt (“graphical text”) file contains a cleaned version of the plain text document, stripped of its .pil notations and transliterated into the ETCBC transliteration. The file also contains “directives” or markers for the document’s language, book name, and verse labels.
Each row of text is preceded on the line above it with a verse header (%verse n,n). A blank newline separates individual verses.
%verse n,n
The .gt file extension is sometimes omitted.
generated by pil2wit
pil2wit
man -s5 gt
%bookname Mesa %language hebrew %verse 1,1 >NK MC< BN KMCJT MLK M>B HDJBNJ %verse 1,2 >BJ MLK <L M>B CLCN CT W>NK MLKTJ >XR >BJ ...
Comments in a .gt file are prefixed with #.
#
The .an (“analysis”) file breaks down the individual words into their morphological parts: prefix, core (the lexeme), infix (morphemes within the lexeme), suffix. The various morphemes are separated with characters. For instance, a prefix is set between exclamation marks: !J!QVL for יקטל. Verbal endings follow an opening brace ([), e.g. the suffix W in !J!QVL[W for יקטלו.
!J!QVL
[
W
!J!QVL[W
The file itself contains three columns with every word on its own row. The first “column” (technically only space-separated) contains the verse to which the word belongs. The second contains the surface form of the word. The third column contains the analytical representation of a word encoded with the ETCBC encoding.
The file is generated by matching the surface forms to a dictionary of previously analyzed words. The dictionary, an anzb (“Analytical ‘Zorg’ Book”) file, contains previously seen surface forms in one column with the encoding of their constituent parts in another. The user can compile anzb files from previous, composite anzbs, or they can compose their own based on their analysis of the text.
generated by analyse
analyse
The .an file has no individual man page.
For details on the encoding patterns, see man -s5 word_grammar.
man -s5 word_grammar
A detailed description of the encoding convention used for the words is available in Verheij’s Grammatica Digitalis I. A shorter description for quick referencing can be found in Description of Quest II Data File Format.
1,1 >NK >NK(J 1,1 MC< M(JC</ 1,1 BN BN/ 1,1 KMCJT KMCJT/ 1,1 MLK MLK/ 1,1 M>B M(W>B/ 1,1 HDJBNJ H-DJBNJ/ ...
The .at (“analyzed text”) file contains the encoded morphology from .an, but it is separated into separate files for each chapter. From this file on, analysis is performed on a chapter by chapter basis (e.g. for the Pesher to Habbakuk: 1QpHab1.at, 1QpHab2.at, etc.)
1QpHab1.at
1QpHab2.at
In the .at file, there are two pieces of data per row. The first is a book, chapter, and verse label (e.g. Mesa 1,1). The second is the morphologically encoded words separated by spaces. Each row is variable length, and verses can run across multiple lines, in which case the book, chapter, and verse label is simply repeated.
Mesa 1,1
The .at file also contains a language identifier at the beginning of a language in the document.
generated by genat
genat
man -s5 at also: man -s5 genat
man -s5 at
man -s5 genat
%language hebrew Mesa 1,1 >NK(J M(JC</ BN/ KMCJT/ MLK/ M(W>B/ H-DJBNJ/ Mesa 1,2 >B/-J MLK=[ <L M(W>B/ CLC/(JN C(N(H/T W->NK(J MLK=[TJ >XR/ Mesa 1,2 >B/-J Mesa 1,3 W:n-!>!<F(H[ H-BM(H/T Z>T L-KM(WC/ B-QRXH=/ Mesa 1,4 BM(H/T JC<=/ KJ ]H](JC<[-NJ M(N-KL/ H-MLK/(JN W-KJ ]H]R>(H[-NJ Mesa 1,4 B-KL/ FN>[/(J-J ...
Word level analysis files contain lexical and parsing data for individual words. The parsing is calculated by intaking the morphological text from the .at file and by applying rules found in process files such as word_grammar, (seeman word_grammar).
word_grammar
man word_grammar
The .ps2 (“phrase structure 2”) file contains the word parsing data in a columnar format. The same format is preserved throughout the remainder of the process as additional columns are simply added in the subsequent .ps files (.ps3, .ps4, and .PX).
generated by at2ps
at2ps
man ps2
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 MESA 01,01 MJC< 0 3 -1 -1 -1 -1 -1 -1 -1 -1 2 2 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 0 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 0 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 0 * ...
Every word in the .ps2 file is placed on its own row. Each column represents lexical, morphological, or parsing information which has been produced by at2ps (or sometimes syn02) based on the .an file. The data is stored as a simple integer. Unless otherwise noted below, -1 means that the category in question in not-applicable (e.g. there are no preformatives in the diagram above, all the words are set to -1). 0 can mean that the value is not present or is unknown, as with every term in the lexical set column above except for DJBNJ. However, a 0 in the part of speech column, for example, means that the word is an article.
syn02
-1
0
The potential codes and their corresponding values in a .ps2 file are presented below, as derived from either the man pages of ps1, ps2, or morfset.
ps1
ps2
morfset
Lexical set is evaluated by comparing both the lexical set column with the part of speech column. If the lexical set number is 0, then there is no value for lexical set. Else, the combinations are presented below.
-6
2
-2
1
-5
-4
4
12
-3
13
The rest of the columns are evaluated by simple correspondence to the id integer.
5
10
6
11
7
3
8
9
14
15
16
17
62
The .ps3 file contains boundaries and features for phrase atoms. New data in the .ps3 records the phrase dependent part of speech (for words within the phrase), phrase type, and phrase determination through an integer value.
The .ps3 file also contains the base information of .ps2, with some modifications. The state column, for instance, is re-evaluated on the basis of the phrase-level syntax and moved toward the phrase atom block.
generated by syn03 with user interaction
syn03
man ps3
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 7 2 MESA 01,01 MJC< 0 3 -1 -1 -1 1 -1 -1 -1 1 2 2 3 3 2 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 2 3 -2 2 * ...
Note that in the .ps3 file, the state column has shifted right into the phrase atom block. Note also that some of the values for state have changed in light of the phrase level analysis. To illustrate, compare the 5th word in the file, MLK in the .ps2:
MLK
MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 0
The last column reads 0, meaning that MLK, in its surface form, is in an unknown state (since the surface form מלך can be either absolute or construct). But in the .ps3 file, the value in the state column (col. 12) has now changed to 1, i.e. construct, because of that term’s relation to the following word, MW>B (“Moab”; thus, מלך מועב “king of Moab”).
MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1
Phrase atom boundaries are implicitly communicated through the presence of a value for phrase atom type. Thus, only the last word of a phrase atom is marked in the .ps3 through any phrase type value other than 0. The corresponding features of the integer value can be seen in the table below. If a value is negative, it communicates a phrase atom apposition of the same correponding positive value.
For instance, in the sample above BN has a phrase type value of 0 (i.e. second to last column) which immediately follows a positive value in the row above. This means the BN begins a phrase atom. The next 4 words have a value of 0 (null) and thus belong to the phrase atom marked off by BN. The last word in the phrase atom (DJBNJ) is marked off with a value of -2, meaning that the phrase atom is in apposition (since it's negative) and that it is a nominal phrase atom (since the absolute value is 2, see the table below). So the whole phrase atom is: "BN KMCJT MLK MW>B H DJBNJ" which is a nominal phrase in apposition to the preceding phrase atom, MJC<.
The values for state and part of speech (phrase dependent) remain the same as in .ps2. The other column values are:
The .ps3.p (“.ps3 parsed”) file contains data on subphrases and subphrase relations. The subphrase allows for recursive embedding (limited to 3 levels on a single word) of various subphrase relationships. For instance, they record relationships between individual elements of a phrase, such as a nomen rectum/regens construction. The .ps3.p file contains the same columns of a .ps3 plus three additional columns.
generated by parsephrases alongside user modifications
parsephrases
man ps3.p
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 7 2 -1 -1 -1 MESA 01,01 MJC< 0 3 -1 -1 -1 1 -1 -1 -1 1 2 2 3 3 2 -1 -1 -1 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 106 -1 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 -20106 306 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 2 3 -2 2 -20106 -1 -1 *
The .ps3.p file introduces three new columns to the .ps3 structure, a subphrase block. Because a word can appear within as many as three different subphrases (and more—this is a shortcoming of the .ps3.p format), three columns are provided for a given word. Only the endpoint of a subphrase is explicitly marked, while the beginning of the subphrase must be calculated. The endpoint is indicated with a subphrase code. In the chart above, any word (i.e. row) with a value other than -1 (NA) represents the end of a subphrase. Note that in the case of the word MW>B (row 6), three different subphrase endpoints are marked.
The subphrase codes contained in the .ps3.p require some extended explanation.
The codes encapsulate data about the beginning point of the subphrase, distance from the subphrase’s mother, and the relationship to the mother. That information is calculated with the following formulae, wherein abs stands for absolute value, / is the division sign, * for multiplication, and % is for the modulo:
abs
/
*
%
-1 * (abs(code) % 10000) / 100
code / 10000
abs(code) % 100
Using this information, we can interpret the three subphrases stored for the word MW>B. The subphrase codes for MW>B are -10002, -20106, and 306. Below is the interpretation for the code -10002. Note that only the whole number from the division is considered in the calculations (e.g. 0 instead of -0.02):
-10002
20106
306
-0.02
-1 * (abs(-10002) % 10000) / 100
= -0.02 or 0
-10002 / 10000
= -1.0002 or -1
abs(-10002) % 100
= 2
Based on this information, we can reconstruct the subphrase. The first word in the subphrase is MW>B itself and the subphrase also ends with MW>B (since it holds the subphrase code). The entire phrase is then MW>B. Finally, the relationship of MW>B to MLK is that of a nomen rectum or genitive (as in MLK MW>B, or “king of Moab”).
The other two codes can be evaluated in the same way. The code -20106 evaluates as: dist. to beginning = -1, dist. to mother = -2, relationship code = 6. Thus, the first word in this subphrase is MLK. The subphrase ends with MW>B (since it holds the subphrase code). The entire phrase is then MLK MW>B. The mother of the phrase is -2 words away, which places it as KMCJT (“Kemeshyat”). And its relationship to its mother is code 6, which means parallel. Since Kemeshyat itself marks the end of a subphrase, this subphrase is also parallel with the subphrase of KMCJT: so “I am Mesa, son of Kemeshyat // king of Moab”.
-20106
dist. to beginning = -1
dist. to mother = -2
relationship code = 6
In the same way, the final code 306 evaluates as beginning = -3, mother = 0, rela = 6. The first word of the subphrase is 3 words back, thus BN. The whole subphrase is: BN KMCJT MLK MW>B (“son of Kemeshyat, King of Moab”). It is a mother subphrase, since itself does not have a mother (0). Its relationship to its daughter is 6 or parallel. If we move down the .ps3.p two rows and evaluate DJBNJ and its subphrase code of -20106, we see that it is the daughter of this subphrase, and that its relationship is also parallel: BN KMCJT MLK MW>B // H DJBNJ (“Son of Kemeshyat, King of Moab” // “the Dibonite”).
beginning = -3
mother = 0
rela = 6
The values for the relationship codes are provided below.
The .ps4 file contains clause atom divisions as established by syn04 with user input. .ps4 utilizes the data from the .ps3 (n.b., not .ps3.p). It repurposes the placeholder asterisks in the .ps3 file, which previously marked only the end of a verse, to the end of each clause atom. An additional column is added to the .ps4 format, but that data is currently deprecated and no longer referenced.
generated by syn04 with user interaction
syn04
man ps4
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 7 2 -1 MESA 01,01 MJC< 0 3 -1 -1 -1 1 -1 -1 -1 1 2 2 3 3 2 -1 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 -1 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -1 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 -1 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -1 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 2 3 -2 2 -1 * MESA 01,02 >B 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 -1 MESA 01,02 J -1 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 2 2 -1 MESA 01,02 MLK= 0 1 0 0 1 -1 -1 2 3 1 2 -1 1 1 -1 -1 ...
Though the asterisk has only served to demarcate verse boundaries up to this point in the .ps files, it now demarcates clause atom boundaries. Though the boundaries in this case did not change for verse 1 of Mesa1.ps4, it did change for verse 2. Compare verse 2 in Mesa1.ps3 to Mesa1.ps4 in the examples repository. The deprecated “catm_flag” column, no longer transmits any valuable data and can be disregarded (see man syn04 for more information).
man syn04
The .ps4.p (“ps4 parsed”) file contains data on the clause constituents, i.e. phrases, and their functions within the clause atoms. It also contains data on phrase-internal relations at the level of phrase atoms. For instance, a noun phrase from the .ps3 file might be evaluated as a subject phrase within the clause established in .ps4. Additionally, a subject phrase may have another noun phrase which functions in apposition to it which is categorized as a phrase atom relation.
The .ps4.p file combines the subphrase block of the ps3.p with the clause divisions of .ps4.
generated by parseclauses with user interaction
parseclauses
man ps4.p and man -s5 ct (for constituent codes)
man ps4.p
man -s5 ct
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 7 2 -1 -1 -1 0 502 -1 MESA 01,01 MJC< 0 3 -1 -1 -1 1 -1 -1 -1 1 2 2 3 3 2 -1 -1 -1 0 521 -1 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 -1 -1 -1 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 106 -1 -1 -1 -1 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 -1 -1 -1 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 -20106 306 -1 -1 -1 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 2 3 -2 2 -20106 -1 -1 -11 500 -1 *
In the figure above, one can see that the new clause constituent columns introduced by .ps4.p are inserted between the subphrase columns from .ps3.p and the defunct clause column from .ps4. The clause boundary asterisk has also been merged into the new file.
Like the codes utilized in .ps3.p, the codes in .ps4.p require some additional explanation.
There are two columns of new data: distance to mother and constituent codes. The two columns convey either:
Like the .psp.3 file with subphrases, only the endpoint of a phrase is indicated, while the start point must be inferred. In this case, the ending of a phrase is marked by a word with a distance value other than -1 (which indicates a null value). For instance, both >NKJ and MJC< (rows 1, 2) function as single word phrases, since they are preceded by no unmarked words, and since they both have a distance value other than -1. DJBNJ (last row) also marks the end of a phrase, but it is preceded by 5 words with null values. It is inferred, then, that those words belong to the phrase demarcated by DJBNJ (the full phrase being BN KMCJT MLK MW>B HDJBNJ, “son of Kemeshyat, king of Moab, the Dibonite”).
If the phrase boundary has a distance of 0, it represents a clause constituent. Its constituent code in the second column can then be referenced against the clause constituents table below. For instance, the first phrase, >NKJ, is marked 0 and is thus a clause constituent. Its constituent code is 502, which stands for a subject phrase (Subj).
502
Subj
For phrase endpoints with a distance less than 0, they convey a phrase atom relationship to another phrase atom. In this case, the distance code must be further parsed to determine the phrase atom’s mother. The distance conveyed may be expressed in word, phrase atom, or clause atom units. To determine both the distance and which unit is being used, follow the flowchart below:
The relationship to the mother is expressed in the subsequent, constituents column. Simply confer with the phrase atom relations table below for the value of the code.
To illustrate, DJBNJ has a non-zero value for distance, and thus stores a phrase atom relation. The code is -11. Since the code’s absolute value is greater than 10, we know that the units of distance conveyed by the distance code is in phrase atoms. That distance is calculated by adding 10 to the code. The distance to the mother phrase atom is thus -1 phrase atoms. If we count back one phrase atom from the beginning of this phrase atom (BN KMCJT MLK MW>B HDJBNJ), we see that the mother is the predicate complement phrase atom, MJC< (“Mesa”). When we look up DJBNJ’s constituent code in the phrase atom relations table, we see that the relationship conveyed (code 500) is apposition. Thus, the phrase marked off by DJBNJ, BN KMCJT MLK MW>B HDJBNJ (“son of Kemeshyat, king of Moab, the Dibonite”) functions in apposition to its mother phrase atom, MJC< (“Mesa”).
-11
500
599
507
501
505
508
531
504
528
532
509
540
534
541
542
511
552
510
519
550
503
572
525
515
512
523
506
522
521
562
567
566
535
582
The .PX (“parsed text”) file contains data at the highest level of analysis, the text level. This includes clause atom relations, clause atom hierarchy, text type, and clause atom type. It also contains new data for functional units such as a clauses (made up of atoms), sentences, and paragraphs, with data for each of those units including clause type, sentence number, and paragraph number.
The .PX file also contains the previously analyzed data from the ps4.p, which itself had combined the clause atom data with phrase level data. .PX thus gives an overview of the full text, and through the clause hierarchies (combined with the newly generated .CTT file, see below) it offers new insight into the text as a whole unit.
generated by syn04types with user interaction
syn04types
man PX man usertab man CARC
man PX
man usertab
man CARC
MESA 01,01 >NKJ 0 7 -1 -1 -1 -1 -1 -1 1 1 -1 -1 7 7 2 -1 -1 -1 0 502 0 MESA 01,01 MJC< 0 3 -1 -1 -1 1 -1 -1 -1 1 2 2 3 3 2 -1 -1 -1 0 521 0 MESA 01,01 BN 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 -1 -1 -1 MESA 01,01 KMCJT 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 106 -1 -1 -1 -1 MESA 01,01 MLK 0 2 -1 -1 -1 1 -1 -1 -1 1 2 1 2 0 -1 2 -1 -1 -1 -1 -1 MESA 01,01 MW>B 0 3 -1 -1 -1 1 -1 -1 -1 1 0 2 3 0 -1 -10002 -20106 306 -1 -1 -1 MESA 01,01 H 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 MESA 01,01 DJBNJ -2 13 -1 -1 -1 1 -1 -1 -1 1 0 2 3 -2 2 -20106 -1 -1 -11 500 0 * 0 1 120 7 100 19 470 52 120 0 0 .N 0 LineNr 1 ClauseNr 1: 1: 2: 200: 0 0 SentenceNr 1 TxtType: Q Pargr: 1 ClType:NmCl
The asterisk has up to now only segmented clause atoms. It is now followed by a row of 14 data columns. The columns on the “star line,” as the row is known, break down further into 12 columns of linguistic data and 2 columns of zero padding that simply serve to segment the data (helpful for parsing the end of the list with computer code). Each of those columns are described below.
beginning of data zero padding to mark the beginning of the clause atom data
clause atom relation list The clause atom relation list varies in length, depending on how many relationships the clause atom shares with other clause atoms. Within the list, clause atoms that are related to the clause atom at hand are represented with pairs of digits separated by a single space.
The first digit contains the distance (in clause atoms) from the present clause atom to the related mother/daughter atom. If the distance is negative, the relationship is upward in the hierarchy tree; if it is positive, the relationship is downward in the tree. Most clause atoms have a mother clause atom (no more than one). To find it, get the clause atom from the list with the greatest negative distance.
There are a few exceptions to the instructions above for clause atoms that either serve as the root (such as in the example above!) or have a downward connection. Root clause atoms have no mother and will usually have an “instructions” value of N (no connection; see two fields to the right, second character). Occasionally a root, motherless atom might also have an instructions value of \ (downward relation). But a downward connection might also indicate a connection to a mother, depending on the nature of the clause atom it is being related to. It must be determined whether the related atom or the clause atom at hand functions as the mother for the the following atoms in the tree. If the related atom is indeed a mother, it will be found by taking the clause atom with the greatest positive distance.
N
\
The second digit is a relationship code (“CARC”, or clause atom relation code) which conveys the specific nature of the relationship. The code is comprised of three digits: the first digit refers a lemma class (such as a conditional conjunction or parallel conjunction), the second digit refers to the verb type of the dauther clause’s main verb, and the third digit refers to the present clause’s (the mother) main verb. To get the corresponding value of those codes, refer to the extensive tables and descriptions available in the Text Fabric clause relations documentation or the CodesList files in the examples repo.
To illustrate using the example above, the first clause atom contained in Mesa 1:1 has 4 clause atom relations in the list. The distances for all of the relations are positive, 1,7, 19, and 52 and the instructions value (two fields over) is N which means that there is no mother clause atom. The first daughter is 1 clause atom below the present atom. Its relation code for the first daughter is 120, which means that the daughter is an asyndetic clause (1--) with a perfect verb (-2-) whose mother clause (the present one) has a no verb (--0). The second daughter is seven atoms down, has a code of 100, which means it is asyndetic (1--), verbless (-0-), connected to the present verbless clause (--0). The third daughter is nineteen atoms down, has a code of 470, which means it has a coordinating conjunction (4--), a wayyiqtol main verb (-7-) and whose mother (this one) is verbless (--0).
19
52
120
1--
-2-
--0
100
-0-
470
4--
-7-
To provide a brief example of a mother relationship with negative distance in the subsequent clause atom’s star line:
* 0 -1 120 1 422 0 0 .. 4 LineNr 2 ClauseNr 1: 1: 4: 122: 0 0 SentenceNr 2 TxtType: Q Pargr: 1 ClType:XQtl
The first clause atom in the relation list is -1 120, which means that the mother is back one clause atom, and is asyndetic (1--). The daughter clause (the present one) has a perfect verb (-2-) and the mother has no verb (--0).
-1 120
data separator The space-separated doube zero, 0 0, signals the end of the clause atom relations list and the beginning of the instructions data.
0 0
instructions Instructions represents data for special kinds of clause atoms in the textual hierarchy. There are two slots, which each describe a certain kind, or subtype, of clause atom. Subtype 1 describes why a clause atom does not have a predicate, if it indeed does not. This corresponds to special types of clause atoms such as ellipsis, casus pendens, etc. The second subtype indicates any special status of the clause atom in the hierarchy, with values such as q for direct speech, # for a new paragraph, or e for embedding (also N for no connections, which we have seen already). All of the possible values are below in the instructions table.
q
e
tab/indentation This field contains a simple integer which describes how many tabs in the hierarchy the clause atom is to be indented.
line number This field derives from the .usertab file (see usertab) where it is used to check the integrity of the file. Since these numbers are consecutive for every clause atom, it might be used in the .PX file to count which clause atom is being referred to.
clause number This feature introduces the functional clause. Clauses are numbered consecutively within a sentence. The clause number can also be used to identify which clause a clause atom belongs to. For example, if there are numerous clauses and clause atoms within a sentence, one would first locate the sentence number and then the clause number. An example from exodus38.PX (vs. 26) helps to illustrate—only the star lines are given:
* 0 -1 100 1 100 2 223 0 0 .. 6 LineNr 68 ClauseNr 1: 1: 1: 200: 0 0 SentenceNr 45 TxtType: N Pargr: 2 ClType:NmCl * 0 -1 100 0 0 .e 8 LineNr 69 ClauseNr 1: 1: 1: 200: 0 0 SentenceNr 46 TxtType: N Pargr: 2 ClType:NmCl * 0 -2 223 1 16 2 220 0 0 d. 7 LineNr 70 ClauseNr 1: 2: 2: 200: 0 0 SentenceNr 45 TxtType: N Pargr: 2 ClType:Defc * 0 -1 16 0 0 .e 9 LineNr 71 ClauseNr 2: 1: 4: 106: -2 -1011 SentenceNr 45 TxtType: N Pargr: 2 ClType:Ptcp * 0 -2 220 0 0 d. 8 LineNr 72 ClauseNr 1: 2: 2: 200: 0 0 SentenceNr 45 TxtType: N Pargr: 2 ClType:Defc
First, note that these are consecutive clause atoms (see LineNr 68, 69, 70, 71, and 72). Also see the sentence numbers. They are numbered: 45, 46, 45, 45, 45. There are two sentences total, with 46 embedded within 45. And sentence 45 also has several clauses! The clause numbers within sentence 45 are: 1 and 2. By bringing both the sentence number and clause number together one can identify which clause a given clause atom belongs.
68
69
70
71
72
45
46
As an aside, the clause number and the three other fields following it (the least/greatest phrase numbers and the clause constituency code) are all followed by colons due to their original representation in the .usertab file where the colon originally marked off a string label.
least & greatest phrase number This number contains the least and greatest phrase (functional) number within a clause atom. Phrases are numbered consecutively within a clause atom, beginning with 1. Thus, normally the least phrase number is 1. In the Mesa example above, the least phrase number is 1: and the greatest is 2: . Therefore, the number of total phrases in that clause atom is 2.
1:
2:
There are cases, however, where a clause atom does not contain a complete phrase, since the phrase begins in a previous one. If there are other, complete phrases in the clause atom, the incomplete phrase is ignored and the numbering starts from the complete phrase. However, there are cases where a given clause atom does not have a single complete phrase at all. In that case, the phrase is numbered in accord with the previous clause atom. An example from genesis01.PX (vs. 7) will illustrate:
* 0 -1 10 0 0 .e 6 LineNr 21 ClauseNr 2: 1: 2: 200: -13 -1006 SentenceNr 19 TxtType: ?N Pargr: 122 ClType:NmCl * 0 -2 223 1 10 0 0 d. 5 LineNr 22 ClauseNr 1: 3: 3: 157: 0 0 SentenceNr 19 TxtType: ?N Pargr: 122 ClType:Defc
There are two clause atoms. Note that in the first clause atom, there are two complete phrases (1: 2:). In the second clause atom, however, there is no complete phrase. The numbering from the first clause atom therefore carries over to the second, and the least and greatest phrase in the second clause atom is numbered as 3: 3:. Note also that even though this phrase begins in the first clause, it is not registered in the first clause since it already has two complete phrases. Only in cases where a clause atom contains no other complete phrase is the phrase numbering carried over from the previous clause atom.
1: 2:
3: 3:
clause type The clause type (functional) field contains an integer which corresponds to a type of clause. The values are provided below in the clause atom/clause type table. Note that the clause atom and clause share clause type codes, but the clause atom type is stored as a string whereas the clause is stored here as an integer.
clause constituent relation The two integers in the clause constituent relation column contain data on relationships between two functional clauses. If there are multiple clause atoms in a clause, the constituent relation is only stored on the first clause atom.
The first integer describes the relation and can be referenced in the clause constituent relation table below.
The second integer describes the distance to the related clause, which can be conveyed in either words, phrase atoms, clause atoms, or sentence atoms. Use the flow chart below to calculate both the unit and distance.
sentence number Sentences are numbered consecutively within chapters.
text type The text type is a clause feature for “narrative”, “discourse”, “quotation” or and combination/embedding of those features. The feature is repeated for every clause atom within that clause. The possible values are related below in the text type table.
paragraph number Paragraphs are numbered consecutively within a text segment. A text segment begins at the root of the clause hierarchy. Paragraphs can be nested. As an example, paragraph 12 would refer to a nested paragraph 2 within a paragraph 1.
clause atom type The clause atom type is a simple string conveying the internal structure of the clause atom.
.
c
d
l
m
p
r
99
101
102
103
104
105
106
111
112
113
121
122
123
131
132
133
141
142
143
151
152
153
157
161
162
163
167
171
172
173
181
182
183
191
192
193
200
213
301
302
303
304
305
-13
?
D
Q
The .CTT (“coded text tabulated”) file does not contain any new data on the text, but its contents are of special importance for presenting, using, and sharing the completed analysis of a text. The file contains a hierarchical layout of the text by entering in the indentations from the .PX file. There is also information on clause atom constituents and their functions.
man CTT
The columns as they appear here are: 0. Line number (not an official part of the file format) 1. Verse Label 2. Person/Number/Gender of the predicate 3. Clause Atom Type of the daughter 4. Indication of the mother 5. Text Type 6. Paragraph Number 7. Clause Atom Number 8. Tabulation and Subtypes 9. Hierarchy made with the surface text from ct4.p