LEXICAL VARIATION IN MEDIEVAL SPANISH: APPLYING QUANTITATIVE METHODS TO SPANISH BIBLICAL TEXTS

In this paper I carry out a quantitative analysis of lexical coincidences between nine medieval Spanish versions of a section of the Book of Judges in the Old Testament . The purpose of this analysis is to obtain an overview of the lexical variation in Spanish medieval biblical translation through the application of co-occurrence methods, correlation coefficients, clusters, principal component analysis and my own method of integration. The results of these quantitative analyses are compared with the relationships proposed in previous studies concerning the degree of relationship between the different Old Spanish biblical translations


INTRODUCTION
In this paper I intend to carry out a quantitative analysis of lexical coincidences between medieval Spanish versions of a section of the Book of Judges in the Old Testament, for which all existing versions offer a complete translation.The purpose of this study is to obtain an overview of the lexical variation in Spanish medieval bibles through the application of various statistical methods.First, in this section I explain the method for laying out comparative lexical data in form of a two-dimensional array of source texts and linguistic forms.I analyze Chapters 13 to 16 of the Book of Judges in the Old Testament, which 1 The research developed for this paper w as originaly presented at the X III Oxford Forum of Iberian Studies on Variation and Change in Ibero -Romance, held at t he Universit y of Oxford in 2008.The pape r , which remained unpublished, has been expanded and translated into English for this occasion.I would like to thank AndrésEnrique-Arias for providing me with bibliographical references as well as for his suggestions concerning diffe rent stages of this research .Likewis e I wish to thank Javie r Muñoz-Basols for his help with proofreading the English version, and Javier Pueyo and Enrique Pato for their comments on my first draft of this paper .This study was supported by the JSPS KAKENHI Grant Number 24520453 (Japan).
In order to focus exclusively on lexical aspects, graphophonemic, morphological and syntactic permutations have been disregarded.For this reason, I have changed the spelling of <çi>, <çe> to <ci>, <ce> and I have normalized the use of <i> and <u> as vowels, and <j> and <v> as consonants 4 .
Variable spellings such as <np> -<mp>, <nn> -<ñ>, <pp> -<p>, etc. have all been converted into a common form (i.e., the second option in each pair).Some specific cases, such as translation equivalents sizra, sidra, sisra 'intoxicating drink' have been considered as variants of the same lexical unit sidra.In addition, in those cases in which variants reflect some change in progress (i.e.omne / ombre; aparescer / aparecer; levar / llevar; ondrar / (h)onrar , etc.), the variants in question have been treated as equivalent across all texts.
In contrast to inflectional variants, derivational ones are treated as cases of lexical variation.These are the result of word formation, and when a word is derived from another, the two words are considered as different forms from the lexical point of view

CORRELATION MATRIX AND CLUSTER
To observe the relationships between each pair of texts, the correlation coefficient Phi in its modified version ('Phi) is calculated from four data points: The result of the calculation is as follows: Observing this table, it is evident that some pairs of texts show a high level of correlation, while others show a relatively low one.For example the correlation coefficient of the pair of Ajuda and E3 is 0.882, the highest of the group.The correlation coefficient of the pair E7 -E19 is also high, 0.865.Should one want to assess the proximities relative to a fixed variable, these figures can also be listed in ascending or descending order.For example, with regard to the text of Ajuda, the figures can be presented as follows: E3: 0.882, E4: 0.615, E7: 0.482, E19: 0.474, Alba: 0.457, E8: 0.258, Fazienda: 0.241, GE: 0.197, indicating a descending level of correlation.
From the correlation matrix, a dendrogram can be produced via cluster analysis (method of mean values between groups) 6 : 139 The graph can be interpreted either from right to left or from left to right.
Starting from the right side, it can be observed that the figure is basically split into two groups {E8, GE} and {Fazienda, E7, E19, E4, Ajuda, E3, Alba}, with Fazienda the first to deviate from the latter group, and then Alba.The remaining texts can be grouped as follows: {E7, E19} and {E4, Ajuda, E3}.By drawing a vertical line at any point, it is possible to establish provisional groupings.For instance, the vertical line in Figure 2 enables a classification into three main groups: {E8, GE}, {Fazienda}, {E7, E19, E4, Ajuda, E3, Alba}.For each group and subgroup, the correlation coefficient that occurs at the point of connection is indicated.As this method uses the average value of coefficients between the groups, the figure is calculated as the sum of all correlation coefficients between relevant pairs divided by the total number of pairs.
In previous studies various groupings of biblical texts have been proposed7 .
My lexicostatistic taxonomy basically coincides with this genealogical group.
The position of Fazienda, of Hebrew origin, is noteworthy, as it maintains its independence up to the point marked as 0.237, where it joins the rest of the family of texts translated from Hebrew.Alba is also independent to some extent 140 from the rest of the Hebrew group.Within the family of Hebrew Bibles the closeness between E7 and E19, on the one hand, and between Ajuda and E3 on the other, is due to each pair being two testimonies of the same traslation.
The advantage of a dendrogram in the study of comparable texts is its ability to offer a multistratic classification based on accurate quantitative measurements.Ideally, the grouping based on qualitative philological analysis should coincide with the quantitative taxonomy offered by multiple correlation analysis.This is precisely the case with the lexical taxonomy described in this study.

PRINCIPAL COMPONENT ANALYSIS
The principal component analysis offers a new interpretation of the variables, which in our case are the biblical texts.It is intended to draw new component lines to explain the variation in the data matrix 8 .The result of this analysis is as follows: Table 5. Principal components According to the table, the component 1 divides texts of the 13th century {Fazienda, E8, GE} from those of the 15th {E4, E7, E19, Ajuda, Alba, E3}.
Component 2 in turn separates {Fazienda} from {E8, GE} by placing them at opposite ends of the distribution of values of that component, reflecting the independence of translation between the extreme two groups 9 .The chart below 8 See Woods et al. (1986: 273 -295).The Principal Component Analysis is one of the multivariate methods seeking internal structure of numerical information.I t prese nts the main components explaining the gre atest variance in distribut ion, instead of the init ial coordinate ax es.By interpreting t he first few compone nts it is possible to fin new classification criteria of data variable s in a reasonable way . 9I owe much to Francisco J.Then data in table 3.2 shows that the first division of the texts into those translated in the 13th century and those translated in the 15th century is twice as important as the division based on source language: {Fazienda}, which was translated from Hebrew as opposed to {E8, GE}, both translated from Latin.The lexical similarity between E8 and GE is remarkable, which is also verified in Table 4 and Figure

CONCENTRATION
In Ueda (2006) I have developed a method for rearranging the two axes of the two-dimensional distribution table.The purpose of this method is to look at the distribution of occurrences (Table 2), through a procedure called This list consists of unique linguistic forms, i.e. those that are not present in more than one group, to the exclusion of other parallel texts in the same verse.
In order to provide a quantitative analysis for this phenomenon of co-occurrence I have prepared the following table: According to the lexical data in Chapters 13 to 16 of the Judges, E4 is closer to {E3 Ajuda} than to E7 (see Figures 1 and 2).
Indeed, according to the latest study of Pueyo (2008), presented in Avenoza

CONCLUSION
To obtain a more complete description and classification of medieval texts and an explanation of the causes of the similarities and differences between them, several linguistic and extralinguistic analyses should be carried out 11 .
Among the linguistic factors to be analysed, graphophonemic, morphosyntactic 12 , and lexical aspects should be considered in order to complement this study with special attention to quantitative aspects 13 .Some of the extralinguistic work, as 11 See, for example , Colón Domènech (2002).
12 Enrique -Arias (2008), Vincis ( 2009). 13Francisco J.Pueyo in evaluating an earlie r vers ion of t his paper suggests the following two points which deserve further study : «(1) Since we have taken into account different copies of the same text (E7/E19 and E3/Ajuda ) it would have been interesting to detail and explain the type of v ariation found between each pair of copies (i.e. which categories exhibit more or less variat ion ).This analysis would have contributed significantly to a better understanding of the mechanisms of lexical adaptation that occurred in the copying process in the M iddle Ages.
(2) The high degree of lexical correlation between E8 and GE requires some form of qualitative discussion, considering whether it could be caused by a direct textual relat ionship or 145 characterics of graphs, paleographical allographs, abbreviations, etc., should also use statistical methods.Each one of these perspectives would offer interesting and complementary contributions, which should be assessed jointly in order to improve our understanding of the world of medieval biblical texts.
The aim of this paper has been to propose the use of quantitative methods exclusively in lexical analysis.The sample text used in this analysis has been limited to four chapters of the Book of Judges.However, it could serve as a good example of how the quantitative methods described here could be applied to the linguistic analysis of other medieval Spanish texts.

REFERENCES
In total I have found 520 cases of lexical variation based on the definitions explained above.I begin the analysis by means of a two-dimensional table contrasting all the examples, with the corresponding line number (chapter and verse) on the vertical axis and the identifiers of the source texts on the horizontal axis 5 : Four points: a, b, c, d where the point (a) corresponds to the number of cases in which both X and Y are positive (+) (i.e. a given semantic unit is observed in both texts), (b) corresponds to the number of incidences of X and not Y (i.e. the semantic unit appears in text X but not in text Y), (c) is the inverse of (b); and in (d) both X and Y are negative (-).The equation for the modified correlation coefficient Phi is: b)(a + c)(b + d)(c + d) It has been changed in order to exclude cases in point (d) indicating common 138 negative observations (i.e. a given expression does not appear in either of the two texts being compared), which has been proved to be irrelevant to the study of the variable lexicon.

Figure 2 .
Figure 2. Tree groups 1. Another highlight is the fact that Alba is closer to the probably had little to do w ith the design of the book and the refore did not have his own literary intention.Precisely the most literary text of these versions would be t hat of General Estoria, which inserts the biblical story into a historical narrative and sometimes paraphrases the biblical translated text.It is actually easie r t o explain Component 2 conside ring that the Fazienda is a translation from Hebrew and not from Latin, which can clearly influence the le xical choice of both groups.Similarly , it is worth pointing out t hat in this Compone nt 2 posit ive values of Fazienda and E19/E7, and re latively low negative values of Alba would indicate different degrees of independence on the part of each one of the translators » [my translation].

10
Llamas (1944: 233)  exp lains the originality of Alba : «ne ither Arragel [translat or of Alba] really depends on the Escorial version [E3], nor t he latter on the version of t he Rabbi» [my t ranslation].And then he emphasizes the Hebrew character of E3 adding that «(...) after I-I-3 [E3] Arragel 's version is the most Jewish of all the Castilian bibles» [id.].144 Table 7. Co-occurrence I would call attention here to the similarities between E8 and GE on the one hand, and E4, E3 and Ajuda on the other, which has been demonstrated also in 2. Pueyo (1996: LVII) presents an equation of the Book of Judges as follows: [E3 / Ajuda ~ E19 = E7 ~ E4] ≠ Alba.My lexicostatistic taxonomy matches this equation except for the position of E4.

(
2008: 40), in regards to the book of Judges, along with the Pentateuch and Joshua, «[E4] does not reproduce any other known romance translations, which makes it unique in the panorama of medieval Bibles (...).Thus, E4 was isolated in the scheme designed by Pueyo (...)» [my translation].

Table 1 .
Contrastive summary To calculate a correlation coefficient for each pair of source texts, it is necessary to convert this contrastive table into a table listing the occurrences of 137 each linguistic form:

Table 2 .
Distribution of occurrences