Croatian media vocabulary is shrinking

According to the Croatian language portal (HJP), the Croatian language contains 116,516 lexical units (words). The portal is a helpful interactive website that many people use to check the spelling, meaning and correct use of words. Vocabulary, of course, varies depending on the location and its application. Have you ever noticed that your vocabulary changes depending on who you talk to? You use a certain type of words at work, whereas while talking to a child, you try to avoid using “difficult” words. If you are an expert in a certain field, your vocabulary differs significantly from everyday speech vocabulary. This certainly applies to professions such as doctors or lawyers, but also electricians, accountants and zookeepers.

The mass media also have their own vocabulary, and by that, we mean a set of words used in writing newspaper or internet articles and in a speech on radio and television news programs.


How many words do the media use?


The spark that ignited our interest was research conducted back in 1980. Dubravko Škiljan took articles from Večernji list and Vjesnik and manually counted how many different words were used. Back in 1980, the size of Vjesnik’s vocabulary contained 75,754 running words and 10,124 lemmas, while Večernji list featured 54,525 running words and 8,653 lemmas.


What are running words and what are lemmas? Why do these figures differ so much and which figure should be taken into account?

When examining words, you have to bear in mind that words do not always appear in the same form. The word “kuća” can appear in the text as kuća, kući, kućama, kuće, kućom etc. When we talk about all the forms of the words, we are talking about running words. Lemmas, on the other hand, are words in their basic form. Lemmas kuća, kući, kućom and kućama are one word because they refer to the same basic form.


Which number then should be taken into account?


Both numbers should be taken into account because each gives different information and together they form the answer to the question concerning the size of the vocabulary. Measuring the number of running words shows how many different words a media outlet has used, while measuring the number of lemmas reveals how many different basic forms of words the media outlet has used. ​ In other words, the richness of vocabulary measured by lemmas evaluates the diversity of the number of words used, while the vocabulary measured by running words reveals how rich the certain vocabulary is with all its variants and forms of words. These two figures, therefore, complement each other.


That’s all fine and dandy, but what’s the situation today?


The data at our disposal are from unpublished research conducted in 2015 by Slobodan Hadžić, PhD, the owner and founder of the company Presscut, and Artur Šilić, PhD, a long-term business associate of Presscut and expert in NLP technology. In 2015, the research was carried out with the help of data processing technologies, unlike in 1980, on which media monitoring and Presscut’s archive of media releases is based, and without which it would be very difficult to collect the necessary data. Unfortunately, Vjesnik ceased publication in 2015, so the research was conducted on articles published by Večernji list, Jutarnji list and Slobodna Dalmacija.

Do you think the media vocabulary has expanded or shrunk in the last 35 years? The answer is that the size of the vocabulary has remained somewhat the same.

The unattainable Vjesnik from 1980 remained the first with 75,754 running words and 10,124 lemmas, but today’s articles in daily newspapers have approximately the same vocabulary size as articles published Vecernji list in 1980, which then featured 54,525 running words and 8,653 lemmas. For example, Večernji list had 51,503 running words and 9,005 lemmas in 2015.


Where do the vocabularies differ?


What is the difference between online and print newspapers? Are there any vocabulary differences?

The answer is: there are. The vocabulary of online newspapers has shrunk. Not only is the vocabulary poorer, but the results related to online editions are worse than printed editions of the same publisher in the following categories:

text length in sentences, text length in words, text length in characters, average sentence length, average word length in syllables, measure of lexical density, lexical redundancy, measure of entity density, word complexity by syllable length and frequency, measures based on word complexity assessment in the newspaper corpus.

Indeed, your further argument might sound something like: “But online editions are expected to be short because people don’t have the will or focus to read long articles on screens, unlike when reading a newspaper”. You are right, but only partly.

It is appropriate that the length of the text of online editions is shorter, but this does not justify the reduced number of different words used in these editions compared to the printed ones.


Consequences of poor vocabulary


Language is a living phenomenon that changes and adapts as life and the environment of the society that uses the language changes. Different nationalities are proud of the size of the vocabulary of their language.

With its diverse vocabulary, language provides greater possibilities of expression. A richer lexical fund demonstrates the ability of language to deal with different topics in appropriate ways.

The mass media continue to be a means of information, entertainment and education (which are their primary functions) and consciously or unconsciously influence the diversity of vocabulary used by a nation.

Eradicating the richness of the Croatian language with poor use of vocabulary causes language impoverishment. The import of words in not a problem, but a consequence. By not using words from our own language, the use of foreign words becomes a necessity. Monitoring the richness of the vocabulary of mass media is certainly a valuable indicator of changes in the richness of the Croatian language in general.