Validating person |
Johanna Cronenberg |
Date of validation |
25.01.2017 |
Contact for requests regarding the corpus |
Bayerisches Archiv für Sprachsignale (BAS)
Institute for Phonetics and Speech Processing
University of Munich (LMU)
Schellingstr. 3
D-80799 Munich |
Number and type of medium |
1 folder (DATA/ ), potentially 5 CDs (approx. 360h) |
Content of each medium: |
Directories DATA/ , METADATA/ , TABLE/ , TEXTGRIDS/ |
Copyright statement and intellectual property rights (IPR) |
This CD or DVD contains copyrights material. Do not distribute without the written consent of the copyright holders:
Romanische Philologie, IT-Gruppe Geisteswissenschaften (ITG)
Ludwig-Maximilians-Universität München
Geschwister-Scholl-Platz 1
D-80539 München
See COPYRIGHT.TXT |
File or Directory Name |
Contents of File or Directory |
COPYRIGHT.TXT |
File containing copyright information |
DATA/ |
Directory containing 2264 wav files |
DOC/ |
Directory containing README file (documentation in English) and the archive DOCU.zip :
COPYRIGHT.TXT : the same file as in ASD/
DOC/ : directory containing the same README file as DOC/
TABLE/ : the same directory as in ASD/
|
GARBAGE/ |
Directory containing 3 txt files |
METADATA/ |
Directory containing 2264 cmdi files |
README.maintenance |
File containing contact and corpus information, and status of validation |
TABLE/ |
Directory containing PROMPTS.TBL with the orthographic transcription of 44 "Wenker" sentences |
TEXTGRIDS/ |
Directory containing 788 TextGrid files |
Parameter |
Explanation of Parameter (optional) |
Value |
Acceptance of Value |
File nomenclature |
Explanation of used codes |
No coherent file nomenclature |
NOT OK |
Settings of recording sessions |
|
No coherent recording settings |
|
Channel |
|
1 |
OK |
Format of signals and annotation files |
If non standard formats are used, it is common to give a full description or to convert into a standard format |
Audio files: .wav, Annotation files: .TextGrid.utf8.phon.txt or .TextGrid.utf8.orth.txt |
OK |
Sample Coding |
|
16-bit sample integer PCM |
OK |
Compression |
|
Not compressed (wav) |
OK |
Sampling rate |
|
22050 Hz |
OK |
Valid bits per sample |
Others than 8, 16 or 24 bits should be reported |
16 bits |
OK |
Multiplexed signals |
Exact de-multiplexing algorithm and tools |
|
n.a. |
Parameter |
Explanation of Parameter (optional) |
Value |
Acceptance of Value |
Multi-party |
Number of speakers, topics discussed, type of setting, formal/informal |
One (or sometimes more) speaker(s) informally talking in an interview or reading sentences in his/her dialect |
OK |
Human-human dialogues |
Type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios |
Interviews (informal chat) about various topics, among them customs, occupation of the speaker, etc. |
OK |
Human-machine dialogues |
Domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system,
e.g. test, operational service, Wizard-of-Oz |
/ |
n.a. |
Parameter |
Value |
Acceptance of Value |
Speaker recruitment strategies |
The 1805 informants stem from 199 locations in Transylvania and some locations in Bavaria,
but only the location of the recording has been encoded in the CMDI metadata.
However, in the majority of cases the recording location should also be the place of living. |
OK |
Number of speakers |
1805 |
OK |
Distribution of speakers over sex, age, dialect regions |
Sex distribution: 1176 female, 629 male. Age range: 5-93, but 383 of unknown age.
All of a Saxonian dialect in Romania (Transylvania) or Bavaria (Wassertal/Oberwischau), 239 of unknown location |
Apart from missing entries: OK |
Description/definition of dialect regions |
Definition by address (city), region, country, and continent |
OK |
Parameter |
Explanation of Parameter (optional) |
Value |
Acceptance of Value |
Unambiguous spelling standard used in annotations |
|
Standard orthography, no punctuation |
OK |
Labeling symbols |
|
Label #1# for speaker, #2# for interviewer, more speakers labeled with #3# and so forth,
unintelligible words are marked as {unversta#|aendlich}, other noise is marked as {Gera#|aeusch} (see README.txt );
However, transcription and orthography conventions are not always consistent (see Manual Validation). |
Apart from inconsistencies (see Manual Validation): OK |
List of non-standard spellings |
Dialectal variation, names etc. |
Dialectal words orthographically transcribed in brackets (<...>),
umlauts have been transliterated to "a#|ae, o#|oe, u#|ue",
German sharp "s" has been transliterated to "s#|sz",
proper names (not place names) have been replaced by ###,
Romanian, Standard German, and Hungarian words are enclosed by brackets, see README.txt |
Apart from inconsistencies (see Manual Validation): OK |
Distinction of homographs which are no homophones |
|
See phonetic transcriptions in TEXTGRIDS/ |
OK |
Character set used in annotations |
|
Standard Latin alphabet for orthographic transcriptions, IPA character set for phonetic transcriptions |
OK |
Any other language dependent information |
Abbreviations, etc. |
|
n.a. |
Annotation manual, guidelines, instructions |
|
Provided in README.txt |
OK |
Description of quality assurance procedures |
|
Not provided |
|
Selection of annotators |
|
No information provided |
|
Training of annotators |
|
No information provided |
|
Annotation tools used |
|
Praat |
OK |
Parameter |
Method |
Result |
Acceptance of Result |
Completeness of signal files |
Script |
All expected files are present |
OK |
Completeness of metadata files |
Script |
All expected information is present |
OK |
Completeness of annotation files |
Script |
All expected files are present |
OK |
Correctness of file names |
Script |
All files are wav files, there is no coherent nomenclature to be checked |
OK |
Empty files |
Script |
No empty files |
OK |
Status of signal, annotation and metadata files |
|
Not provided |
|
Signal durations |
Script |
No information about (average) signal durations provided, all signal durations were larger than 0 |
OK |
Duration cross checks |
Script |
17 out of 352 matching wav and TextGrid.utf8.phon.txt files showed different durations: see table below |
NOT OK |
Cross checks of meta information |
Script |
All audio files are mentioned in viewclarinsession.csv and viewclarinmedia.csv ,
all annotation files are mentioned in viewclarinannot.csv ,
column Location.Address in viewclarinsession.csv misses 239 entries, column Age in viewclarinactors.csv misses 383 entries |
Apart from missing entries: OK |
Cross checks of summary listings |
|
|
n.a. |
Annotation contents |
|
|
n.a. |
Annotation tier nomenclature |
Script |
All tiers have the expected names: phon_informant, phon_interviewer, phon_comment, phon_xxx,
orth_informant, orth_interviewer, orth_comment, orth_xxx |
OK |
Annotation texts |
Script |
In 170 out of 436 files occur other than the expected annotation labels, the script reported 118 erroneous patterns |
NOT OK |
Point of Criticism |
Expected Value |
Examples: [interval] "Erroneous Value" - "Correct Value" |
Orthography |
Consistent spelling without spelling mistakes |
1414.TextGrid.utf8.orth.txt:
- [15] "eien" - "einen"
- [19] "Eltrn" - "Eltern"
1330.TextGrid.utf8.orth.txt:
- [6] "klleines" - "kleines"
- [14] "Klasen" - "Klassen"
- [27] "Waser" - "Wasser"
- [28] "maht" - "macht"
|
Inconsistencies in Spelling Conventions: German sharp "s" |
Ideally transcribed consistently, but has been transcribed as "s#|sz" |
1169a-02.TextGrid.utf8.orth.txt:
N_11.TextGrid.utf8.orth.txt:
|
Inconsistencies in Spelling Conventions: German Umlauts |
Ideally transcribed consistently, but have been transcribed as "a# u# o#|ae ue oe" |
1169a-02.TextGrid.utf8.orth.txt:
- [8] "verwuesteter"
- [6] "aermer"
- [7] "Hoefchen"
N_11.TextGrid.utf8.orth.txt:
- [6] "Goldstu#cke"
- [5] "tra#gt"
- No example of "o#" found in the 22 manually checked TextGrid files
|
Inconsistencies in Spelling Conventions: Unintelligible Words |
Ideally transcribed consistently, but have been transcribed as {?} | {unversta#|aendlich} |
N_11.TextGrid.utf8.orth.txt:
- [1, 2, 6, 12, 20, 22] "{?unversta#ndlich}" - "{unversta#ndlich}"
1169a-02.TextGrid.utf8.orth.txt:
|
Turn-Taking Tags |
One speaker per interval would be ideal; speaker as "#1#", interviewer as "#2#", more speakers as "#3#" and so forth |
928b-02.TextGrid.utf8.orth.txt:
- [20] "#1# #1#" - "#1#"
- Mixed turn-taking in 10 out of 154 intervals: [61, 84, 89, 92, 94, 98, 99, 112, 117, 146]
1169a-02.TextGrid.utf8.orth.txt:
- [47] "##" - "#1#"
- [65] "#1" - "#1#"
1454c-04.TextGrid.utf8.orth.txt:
- [181] "#1#" - "#2#"
- [241] "#2#" - "#1#"
|
Inconsistencies in Transcription Conventions: Dialectal Words |
One dialectal word transcribed in brackets (<...>) right after original word, more than one dialectal words in between "<...</>" |
1454c-04.TextGrid.utf8.orth.txt:
- [2] "Hanklichbacken" - "German original word<Hanklichbacken>"
|
Inconsistencies in Transcription Conventions: Standard German, Hungarian & Romanian Words |
German words in between the tags "<<d>...</d>>" Hungarian words in between "<u>...</u>",
and Romanian words in between "<r>...</r>" |
Standard German Words:
- 1169a-02.TextGrid.utf8.orth.txt: [2] "<<d>je nachdem</d>>"
- 1454c-04.TextGrid.utf8.orth.txt: [58] "<d>Herzlich Willkommem zu unserem Hochzeitsfest</d>"
Romanian Words:
- 1169a-02.TextGrid.utf8.orth.txt: [46] "<<r>Presedinte</r>>" - "<r>Presedinte</r>"
- 1169a-02.TextGrid.utf8.orth.txt: [48] "sine Lisaweta" - "<r>sine Lisaweta</r>"
Hungarian Words:
- 33-12.TextGrid.utf8.orth.txt: [1] "<u>Veszprem</u>"
|