<TAN-A-lm>
)TAN-A-lm files are used to annotate a class-1 source by specifying the lexical and morphological properties of its tokens or morphemes.
Every TAN-A-lm file has two different types of dependencies: a class 1 source
(optional) and the grammatical rules defined in one or more TAN-mor files. This
section therefore should be read in close conjunction with the section called “Morphological Concepts and Patterns (TAN-mor
)”).
TAN-A-lm files are either source-specific or language-specific.
Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Source-specific TAN-A-lm files are useful for closely analyzing the grammatical properties of the words in one particular text. Well-curated source-specific TAN-A-lm files are enormously useful for other applications, e.g., quotation detection. Any source-specific TAN-A-lm file can be converted into a language-specific one, to be used as noted below.
Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, perhaps just one text. Others might rely upon a vast, general one. Language-specific TAN-A-lm files are useful for building language resources for computer applications. Many language-specific TAN-A-lm files become the basis for a local language catalog, which can be used to populate a new source-specific TAN-A-lm file.
Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used.
Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities.
Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.
The root element of a lexico-morphological file is TAN-A-lm.
If the file is source-specific, <source>
points to the one and only TAN-T(EI) file that is
the object of analysis. If the file is language-specific, <for-lang>
is used in the
declarations section of the <head>
to indicate the languages that are covered.
For highly inflected languages, language-specific TAN-A-lm files can be
enormous in size or quantity. To improve performance when validating and
processing numerous or large language-specific TAN-A-lm files, the <head>
may also include <tok-starts-with>
and
<tok-is>
. It is
common for language-specific TAN-A-lm files to be cataloged in a <collection>
file. These
become part of the local language catalog, bound to the global parameter
$tan:lang-catalog-map
, found in
parameters/params-application-language.xsl
. By including in that
parameter your collections to language-specific TAN-A-lm files, you open up those
resources to use in a variety of other applications. In that <collection>
file, the
individual <doc>
s that point to
language-specific TAN-A-lm files should include as children any <tok-starts-with>
and
<tok-is>
as in the
original.
Example 6.4. Example of a catalog entry for a language-specific TAN-A-lm file
<doc href="lat-tan-a-lm-abu.xml" TAN-version="2021" id="tag:kalvesmaki.com,2015:tan-a-lm:lat:perseus:abu" lexicon="LS" morphology="perseus-dik" claimant="xslt1" root="TAN-A-lm"> <name xmlns="tag:textalign.net,2015:ns">Perseus lexico-morphological permutations devoted exclusively to abu</name> <license xmlns="tag:textalign.net,2015:ns" which="Attribution-ShareAlike 3.0 Unported" licensor="perseus"/> <for-lang xmlns="tag:textalign.net,2015:ns">lat</for-lang> <tok-starts-with xmlns="tag:textalign.net,2015:ns">Abu</tok-starts-with> <tok-starts-with xmlns="tag:textalign.net,2015:ns">abu</tok-starts-with> <tok-starts-with xmlns="tag:textalign.net,2015:ns">abú</tok-starts-with> </doc>
Conversion from a source-specific TAN-A-lm to a language-specific one is a one-way operation. There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file.
<vocabulary-key>
takes the elements other class-2 files take (see the section called “Class 2 metadata (<head>)”. It also permits two elements unique to TAN-A-lm: <lexicon>
(optional) and
<morphology>
(mandatory). Any number of lexica and morphologies may be declared; the order is
inconsequential.
There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the the section called “Digital entity metadata pattern”, the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited.
Because you or other TAN-A-lm editors are likely to be authorities in your own
right, <person>
can be
treated as if a <lexicon>
,
and be referred to by @lexicon
.
<body>
)The <body>
of a TAN-A-lm
file takes, in addition to the customary optional attributes found in other TAN
files (see the section called “Edit stamp”), @lexicon
and @morphology
, to specify the default lexicon and
grammar.
<body>
has only one type of
child: one or more <ana>
s
(short for analysis), each of which matches one or more tokens (<tok>
) to one or more lexemes or
morphological assertions (<lm>
, which takes zero or more <l>
s followed by one or more
<m>
s).
An <ana>
may take a
@tok-pop
, to specify
the number of tokens that the assertion applies to. This is particularly helpful
for language-specific files based upon a limited corpus of texts, where the
underlying data for the assertion might be difficult or impossible to retrieve.
The token population can be used to calibrate levels of certainty, or to compare
statistical profiles of one TAN-A-lm file against another.
If you wish to point to a linguistic token that straddles more than one token,
you should use multiple <tok>
s,
wrapping them in a <group>
.
Any token may be the object of as many <ana>
s as you like. In fact, this is preferred if you
wish to register competing claims or alternatives.
Claims within an <ana>
are distributed. That is, every combination of <l>
and <m>
(governed by <lm>
) is asserted to be true for every <tok>
or <group>
.
If an <lm>
lacks an
<l>
, the token value its
itself, calculated by each <tok>
, is taken to be the default value of the lexeme.
All assertions are assumed to be made with 100% confidence unless @cert
is invoked. This still holds
even when a <tok>
is the
subject of multiple <ana>
s,
because it is possible to be completely confident that a given word has two
different grammatical profiles in the target text (e.g., puns, wordplay).
Many TAN-A-lm files will be generated by an algorithm that automatically lists
all possible morphological values of each token. It is advised that such automatic
calculations always include in their output @cert
, with weighted values. That is, if an algorithm
identifies two possible lexico-morphological profiles for a word, but one occurs
nine times more than the other, then it is advised that this be reflected in the
two resultant elements, e.g.: <lm cert="0.9">...</lm>
and
<lm cert="0.1">...</lm>
. If an algorithm is written with a
more sophisticated way to weigh possibilities, then adjust the value of
@cert
accordingly. Be
certain that the <algorithm>
is credited in the <vocabulary-key>
and in a
<resp>
.
As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process.
A TAN-A-lm file is rendered more efficient when claims can be grouped. If a
particular token invariably has a single lexico-morphological profile, this can be
declared once, in a <tok>
that
does not have @ref
. If the token
has a particular profile in a given region of text, it can be specified through a
@ref
that encompasses the
specified region. You do not need to provide a <tok>
for every token, which would entail restricting
@ref
to leaf divs. You may
do so, but such an approach can result in very long files that are time-consuming
to validate, process, and edit. It is more advantageous to declare
lexico-morpological properties more generally, thereby replacing numerous leaf-div
<tok>
s.
The benefits in processing time are significant. In early versions of TAN, the
lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a
TAN-A-lm file of 407,811 <tok>
s, one per token per leaf div, grouped in 52,703
<ana>
s (25.8 MB).
Early 2020 validation routines took about 25 minutes (2018 validation routines
took hours). The long processing time is due primarily to the TAN-A-lm file
itemizing every single token in the text. That same file was revised to be more
declarative along the lines advocated above. If a particular token had only one
lexico-morphological profile throughout the text, then every instance was reduced
to a single <ana>
, with no
@ref
in <tok>
. When a particular token value
had different lexico-morphological profiles, @ref
targeted the rootmost <div>
that encompassed them all. This revision resulted in
a smaller file (15.8 MB; 158,376 <tok>
s in 54,335 <ana>
s) that validated in about a third of the time
(8.5 minutes).
In general, there is always a trade-off between convenience and efficiency. If
your priority is speed, you should break a large file into several smaller ones,
perhaps recombining them in a master file via <inclusion>
(see the section called “Networked Files”).
Applications can be written to convert TAN-A-lm <m>
data from one morphological system to another. This
is a two-step process facilitated by the functions tan:morphological-code-conversion-maps(
) and tan:convert-morphological-codes(
). See documentation in these
guidelines or in
functions/language/TAN-fn-language-extended.xsl
.
Example 6.5. Examples of TAN-A-lm data
<ana> <group> <tok ref="1" pos="1 - last-1"/> </group> <lm> <l>ring-a-ring-a-rose</l> <m>NNS</m> </lm> </ana> . . . . . . . <ana> <tok ref="10 6 3 2" pos="4"/> <tok ref="10 6 3 3" pos="15"/> <tok ref="10 6 4 2" pos="37"/> <lm> <l>Σωκράτης</l> <m>n e - s - - - m g -</m> </lm> </ana> . . . . . . . <ana> <tok val="τούτῳ"/> <lm> <l>οὗτος</l> <m cert="0.358311302048909457">p d - s - - - m d</m> <m cert="0.241688697951090546">p d - s - - - n d</m> <m cert="0.2">p - - s - - - m d</m> <m cert="0.2">p - - s - - - n d</m> </lm> </ana> . . . . . . . <ana> <tok val="ABERRO"/> <tok val="Aberro"/> <tok val="aberro"/> <lm> <l>aberro</l> <m>v - 1 s p i a</m> </lm> </ana>