Digital Scholarship at The National Library of Wales

Documentation and musings of Jason Evans, Open Data Manager at NLW

Exploring Named Entity Recognition Feasability for Welsh Language Text

Summary

The aim of this study is to assess the viability of Named Entity Recognition and alignment to external datasets for Welsh language texts.

This document evaluates and compares four pipelines designed for Welsh Named Entity Recognition (NER), focusing on their ability to identify place names and align them with Wikidata entities. The pipelines were tested on a selection of text from historical and modern Welsh sources, with varying levels of success. Key findings and recommendations for improving Welsh NER capabilities are outlined.

Model 2 currently provides the best balance of accuracy and utility, particularly for automated entity suggestion and trend analysis. With targeted improvements to Welsh NER models, especially in mutation handling and training, further advancements can be achieved. Integrating gazetteer-based approaches or leveraging LLMs offers promising pathways for future development.

The models

1.This code implements a Welsh Named Entity Recognition (NER) pipeline. It uses a spaCy model trained on Welsh language data to identify entities in text, then queries Wikidata to find corresponding entities, displaying their labels, descriptions, and URLs. The pipeline uses the spacy library, a pre-trained Welsh language model from Bangor University, ipywidgets for interactive text input, and the Wikidata API. The user can input text, and the code will output the identified entities with corresponding information from Wikidata. https://colab.research.google.com/drive/1SXoe4nyUqG_cM4kOpXGwjiOwyAg7BnDH?usp=sharing 

2. This code performs Welsh Named Entity Recognition (NER) by first translating Welsh text to English using the googletrans library. It then uses a spaCy model (en_core_web_wales_ner_lg) trained on Welsh data to identify entities in the translated text. Finally, it queries Wikidata using the requests library to provide additional information about the identified entities. The code also includes an interactive component with ipywidgets to input text and display results within the notebook. https://colab.research.google.com/drive/1Tp7gZRyaLdIYzSVPfGsfM5bNsdSPNsWW?usp=sharing 

3. This code uses the GATE Cymrie Welsh Named Entity Recognizer API to identify entities in Welsh text. Then, it applies a set of rules to remove mutations, which are common in Welsh, from the identified geographic entities. Finally, it uses the Wikidata API to search and suggest matching Wikidata entities, providing labels, descriptions, and URLs. The code also includes a user interface with widgets for text input and processing, allowing users to interactively process text. https://colab.research.google.com/drive/1B_LxnimeCYOb2d5uZC6RozrL-z-0Bicw?usp=sharing 

4. This code uses the GATE Cymrie Welsh NER service to identify entities in a given text. For entities identified as locations, it attempts to remove any mutations and then searches for matches in a pre-loaded gazetteer of Welsh place names. The code uses pandas to manage the gazetteer data, and the requests library to interact with the GATE API. It also utilizes ipywidgets for the user interface. https://colab.research.google.com/drive/1aeQqrhusrxpjZc2nYlpSpQfR8c3T0DHk?usp=sharing

 

Initial test data

This small set of sentences has been selected to represent a cross section of resources at NLW from historic fiction and newspapers to modern text from our Website and the Welsh Biography Online. The text has been chosen to specifically test for the ability to identify place names from welsh language text. No geographic entity matches have been discarded.

Models were given a percentage score based on the number of places names correctly recognised and matched with Wikidata’s highest ranking suggestion or, in the case of model 4, a direct match in the place name gazetteer.

Model score
Text to AnalyzeSource1234
“Sefydlwyd Cymdeithas Edward Llwyd gan Dafydd Davies, Rhandirmwyn ym 1978”NLW Blog0100100100
“Mae un llwybr o’r fath, sef ‘Llwybr y Mynach’, llwybr canoloesol sy’n cysylltu abatai Cwm-Hir (ger Llandrindod, Powys) ac Ystrad Fflur (ger Pontrhydfendigaid, Ceredigion) o’r ddeuddegfed ganrif, yn parhau i fod yn llwybr cyhoeddus poblogaidd hyd heddiw.”NLW Blog16.6733.3366.6766.67
“Ar ôl gadael yr ysgol, gweithiodd fel newyddiadurwr gyda’r Western Daily Press ym Mryste, cyn cael ei gwysio i’r fyddin am weddill y rhyfel, yn swyddog cuddwybodaeth yn yr Eidal, yr Aifft a Phalesteina.”Dictionary of Welsh Biography010000
“Ganwyd Hugh Matthews ar 25 Hydref 1936 yn 6 Heol Bryn-gelli, Tre-boeth, Abertawe”Dictionary of Welsh Biography5010050100
“Nos Lun, Mehefin 19eg, traddodwyd darlith yn nghapel y Berth (capel y Methodistiaid Calfiniiidd ger Tregaron), gan y Parch. David Bees, Bronant”Welsh Newspapers Online33.3333.3366.6766.67
“BETH WNEIR yn Llandderfel yw testyn ysgrif ddyddorol yn “Mhapur Pawb” am yr wythnos hon.”Welsh Newspapers Online0100100100
“YN Nghefncoedycymer, Merthyr, boren dydd Sadwrn diweddaf, cafwyd gwas- anaethferch o’r enw Sarah Rogers, brodor o Castellnewydd Emlyn, ond yn ddiweddar yn ngwasanaeth Mr. H. O. Pearce, Tanybryn, wedi ymgrogi.”Welsh Newspapers Online25502550
“Ar y ffordd wrth fynd i Ruthyn, gwelais ddyn yn gwerthu brethyn”Yr Hwiangerddi (1911)010000
“Enw diddorol ar gae rhwng Llangrallo a’r Coety ye ‘Kae Cocid’”Welsh Journals Online050100100
“Dywedir idd deithio cyn belled â Sir Fôn, ac iddo aros hefyd ar daith i Forgannwg a Mynwy…a bydd rhaid iddo roi’r gorau i’w swydd yn yr eglwys yn Llangrannog”Welsh Journals Online01005075
Total Scores125766.6558.3657.3

Model 1 performed poorly scoring 125 out of a possible 1000. Whilst this pipeline performs well with English text, it is clear that the Spacy model, even with additional Welsh training data struggles to identify entities in Welsh language text.

Model 2 scored the highest of all pipelines tested. This pipeline uses the same process as Model 1 but it uses Google’s translation API to first translate the entire text into English. This allows for full utilization of the Spacy model. This was the only pipeline to successfully identify any of the place names in the thirds piece of text. Whilst pipelines 3 & 4 use a basic algorithm to identify mutations, it seems that Google’s machine translation handles those mutations better. This in large part is responsible for Model 2 outperforming Model 4.

Model 2 Process Diagram

Model 3 ranks 3rd in this test. This model uses the only freely available NER model trained solely on Welsh Language Data (CYMRIE). However it struggled with mutations, meaning many placenames were not identified at the NER stage, and therefore were not passed further down the pipeline to the mutation algorithm or the Wikidata search.

Model 4 takes a slightly different approach to all the others as it uses a placename gazetteer which is aligned to Wikidata, with multiple alternative spellings, rather than searching Wikidata directly. This pipeline performed reasonably well, scoring 657/1000. By using a better NER model, or combining the gazetteer with the translation process seen in Model two it is possible that this pipeline could perform significantly better.

Further Analysis of Model 2

Model 2 was further tested on two longer pieces of text, each containing 10 names of significant places (appendix 1). The first text was extracted from a 1910’s Welsh language newspaper. The sentence structures are somewhat archaic. This text was chosen deliberately to test the pipeline on historical content. The second piece comes from a Dictionary of Welsh Biography article written in 2020.

Historic Text
Placenames
NER identifiedWikidata correctly matchedModern Text PlacenamesNER identifiedWikidata correctly matched
Leeds Llandudno
Yorkshire Ben-y-Gogarth
Nghaergybi Salt Lake City
Dublin afon Missouri
KillingworthNghymru
New-castle-on-Tyne Utah
HandsworthBrifysgol Michigan
BirminghamMhrifysgol Pennsylvania
PorthmadogPhiladelphia
PontardaweEwrop
Score60%83.3%Score80%100%

This test reinforces the high accuracy of the Wikidata matching where entities are successfully identified. It also suggests that the NER model struggles to identify place names well in older texts which likely differ in structure and wording to modern texts on which the model has been trained. 

Recommendations

Model 2 seems effective enough to produce entity suggestions for manual checking or for trend analysis. Model 4 also seems viable and could be likely improved with some refinement to the syntax criteria – for example it currently does not find place names with more than one word.

Training NER specific models for Welsh using a large corpus would likely see accuracy scores improved further. Providing Welsh training data to newer English or multilingual models may also provide better results.

In the meantime, LLM’s such as GPT4o seem to be able to carry out entity recognition in Welsh to a good standard and can even provide contextual output such as the type of place or the county of the place. This should be explored further, although this method is expensive, both financially and environmentally.

For focused entity recognition looking only for place names or organisation names, developing a Gazetteer based model may prove most effective. This is less likely to be effective for personal names due to a high rate of ambiguity. 

Testing across larger datasets and developing a workflow for identifying incorrect matches to Wikidata is also recommended. 

APPENDIX 1

Longer texts for further analysis of Model 2.

1.

Source – Y DYDD (1914) –  https://newspapers.library.wales/view/4107089/4107094/25/er%2BOR%2Bgof

“NEWYDDION. Y mae yn debygol mai dydd Mawrth, Ebrill 21ain, fydd dyddiad ail ddarlleniad Mesur Dadgysylltiad yr Eglwys yng Nghymru. Mewn cyfarfod o Gynghor Trefol Leeds ddoe, dvwedwyd i’r streic ddiweddar yn y ddinas gostio o leiaf £112,000. Gwariwyd £16,000 ar ddwyn heddgeidwaid dyeithr i’r ddinas. 

Y Mae nifer y glowyr sydd ar streic yn Swydd Efrog yn awr yn 100,000. Y mae cynhadledd o gynrychiolwyr y meistriaid a’r dynion i gymeryd lle yfory. 

Yn Nghaergybi. ddoe, cynhaliwyd cwest ar gorff dyn gafwyd yn yr Harbwr ddydd Llun, yr hwn ni lwyddwyd i’w adnabod. Dychwelodd y rheithwyr reithfarn o “Cafwyd wedi boddi.” 

Dydd lau, yn Dublin, bwriwyd dyn a ddywedai mai capten yn y fyddin ydoedd, i benyd-wasanaeth am dair blvnedd am dderbyn mil o bunau drwy ddweyd fod deng mil o bunau yn dyfod iddo trwy ewyvllys ei fam. 

Canfuwvd fod haint wedi torri allan yn mysg moch yn Killingworth, ger New-castle-on-Tyne. a dinystrir dros gant o foch. 

Gwnaed cryn ddifrod gan dân i ddau dŷ yn Handsworth, Birmingham, ddydd Llun. Credir mai merched y bleidlais ydoedd achos y tan. 

Dydd Gwener, yn Llys Trwyddedol Porthmadog, gwrthwvnebwyd y cais i adnewyddu trwydded y Maddock’s Arms Hotel. Tremadog.—Gwrthodwyd y cais, a chymellwyd diddymu y dafarn drwy dalu iawn i’r perchenog. 

Fore Sul, gwnaed difrod mawr gan dân ym Mhalas Pontardawe (Abertawe), eiddo Mr. Charles Gilbertson, o ffirm Gilbertson a’r Cwmni, gwncuthurwyr alcan a dur a chaed colled o thua £3,000.”

2.

Source – Dictionary of Welsh Biography – https://bywgraffiadur.cymru/article/c12-CANN-HUG-1857 

“Ganwyd Martha Hughes Cannon yn Stryd Madoc, Llandudno ar 1 Gorffennaf, 1857, yn ail o dair merch Peter Hughes (c.1825-1861), saer, a’i wraig Elizabeth (ganwyd Evans, c.1833-1923). Ar y pryd roedd cymuned fechan o Formoniaid yn hen bentref Llandudno ar Ben-y-Gogarth, ac mae’n debyg bod Peter ac Elizabeth Hughes yn aelodau ohoni. Eu cyfeiriad olaf yng Nghymru, a gofnodwyd yn rhestr teithwyr ‘The Underwriter’, y llong a’u cludodd ar draws Môr Iwerydd yn 1860, oedd Tanygraig sydd i fyny yn yr hen bentref. Profiad dirdynnol i’r teulu oedd croesi o afon Missouri i Salt Lake City mewn cert ychen yn 1861. Bu farw Annie, chwaer iau Martha, ar y Gwastadeddau, a bu ei thad farw dridiau ar ôl cyrraedd Utah.

Pan gyrhaeddodd Martha oedran gadael ysgol, penderfynodd ar yrfa yn gofalu am gleifion. Roedd cynlluniau ar droed ar gyfer ysbyty mamolaeth yn y ddinas, ac ystyrid y dylai afiechydon merched gan eu trin gan feddygon o ferched. Galwyd Martha gan yr Eglwys i ddilyn cwrs meddygaeth, yn un o’r tair merch gyntaf yn Utah a alwyd i wneud hynny. Yn 1878, aeth i Brifysgol Michigan i astudio am radd MD. Cychwynnodd wedyn ar radd ôl-raddedig ym Mhrifysgol Pennsylvania, a chwblhaodd ei haddysg â chwrs mewn areithio cyhoeddus, gan dderbyn gradd ‘Batchelor of Oratory’ o’r ‘National School of Elocution and Oratory’ yn Philadelphia. Yn ôl pob golwg roedd gyrfa lwyddianus a buddiol o’i blaen, ond nid felly roedd hi i fod. Ni fyddai bywyd fyth yn rhwydd i Martha Hughes. Roedd yn ferch gymhleth a diildio, wedi ei rhwygo ddwy ffordd, gan ei ffydd geidwadol ddofn ar yr un llaw a’i gwleidyddiaeth radicalaidd danllyd ar y llall. Wedi pedair blynedd llwyddiannus yn yr ysbyty, yn 1886 cefnodd yn sydyn ar ei gyrfa, a ffoi o Salt Lake City i Ewrop, gan fynd â Lizzie, ei merch saith mis oed, gyda hi, a gadael o’i hôl y gŵr yr oedd wedi ei briodi’n gyfrinachol ddeunaw mis ynghynt. Roedd Angus Munn Cannon yn un o gyfarwyddwyr yr ysbyty, yn ddinesydd Mormon blaenllaw ac yn frawd i un o’r Cworwm o Ddeuddeg Apostol. Roedd dair blynedd ar hugain yn hŷn na Martha ac yn dad i ddau ar bymtheg o blant. Martha oedd ei bedwaredd wraig.”

Next Post

Leave a Reply

© 2025 Digital Scholarship at The National Library of Wales

Theme by Anders Norén