Wanneer is het woord 'de' en wanneer is het woord 'het'?

After living in the Netherlands for the past five years, I have finally started to learn Dutch seriously and seems like I am not alone. A survey done by DutchNews.nl claims that 17% of the respondents were learning Dutch during.

I am currently in A1-A2 level and I am just starting to learn about articles De/Het in Dutch.In dutch language the word ‘de’ or ‘het’ is always before the Noun. If there is ‘de’ before the noun then the word is probably male or female. If it has ‘het’ then it means that the word is gender neutral. This is particularly confusing for me because none of the languages I know have explicit gender before a Noun.

Problem

So in order to properly write/speak dutch, I need to know if a word is de or het (article). When my dutch teacher said, “We Dutch are crazy so there are no clear rules!” (paraphrased). This means I have to memorize every word with its article. .

The only problem is that I hate memorizing. I always liked math and science because there was a clear reasoning behind it. Language is not always like that. There are few rules but there are always exceptions to the rule. This adds the difficultly in learning the language.

Solution

Every Dutch person I know kind-of intuitively knows the rule because they have been speaking the language since they were born. What if we could map out this intuition? What if we could extract rules from the chaos using data?

This is exactly I will try to do in this blog post. The goal of this post are as follows:

If you are just interested in the fancy figures then you can go to the result section directly

Side note : I know that there already exists some rule. I will list them below. Do correct me if I am wrong.

  1. If the word ends with ‘jes’ then the article is always ‘het’.

Technical jargon

  1. I will be using various NLP based technology to accomplish them. Namely, I will make use of spaCy to extract the Nouns and the vector representation. For larger models, spaCy provides these vector representations out-of-the-box.

  2. I will use t-SNE to reduce the n-dimensional vector representation to 2D. I am particularly using t-SNE because I don’t know the exact number of clusters. t-SNE magically (not really) does this for you.

Without further ado, lets get started!

Data collection

I am using data made openly available from Spoken wikipedia Corpus. They have over 224h of spoken dutch language content. I am only interested in the text so I downloaded the dutch text their website .

So I am lucky that I do not have to manually collect articles from the Internet. The data is compressed file that can be extracted easily.

A quick snap shot shows that the each there are document about various topics like Zwarte Piet and Zwolle

Folder structure

Upon opening each directory we see that there are different files. I quickly saw that I only needed audiometa.txt.

Needed file

I collect all these files and put them together in a pandas dataframe for further analysis. You can find the code for automatically downloading the compressed file, extracting the text and storing in data frame in this notebook

There are 3065 possibly readable documents. Out of them, 3051 are machine readable.

Data pre-processing

During this step, I extracted the nouns from the text using spaCy’s powerful linguistic features. The linguistic feature includes things like Part-of-speech (Noun, Verb, Adjective). I think it is super cool that it handles the parsing and extraction and makes the whole process as painless as possible. For example, the following displaCy output makes it super easy to see relation between text.

displaCy
De DET zwarte ADJ deur NOUN is VERB nu ADV gesloten, VERB in ADP het DET slot NOUN gevallen. VERB Je PRON bent AUX er ADV doorheen ADP gegaan. VERB det amod nsubj:pass aux:pass advmod case det obl advcl nsubj aux obl case

It provides these neat dependencies between words that enables me to quickly check for pattern and extract the data I need. For example, we know that ‘deur’ (door) is a noun. It’s article is ‘de’ but how can we extract it based on this relation? We can extract the children that point away from the word ‘deur’ i.e De, zwarte.

I use very simple check to extract the article from all the documents. I have tried to write it as a psuedocode below.

if word is Noun:
  extract its children
  if one of the child is 'de' or 'het':
    save the article and word

On top, I also selected spaCy’s vector representation of the word. I can now use this valuable high-dimensional data for various purposes. For instance, it can be used to find relation between other words using cosine similarity. I will provide this to t-SNE algorithm to find the the projection in 2D space. Furthermore, this vector data can also be used to train machine learning algorithm that can classify if a word requires De or Het.

Using this simple algorithm I was able to extract 21812 unique words with their corresponding article.

The code that automatically does this is available as notebook here.

Result

Cluster

t-Distributed Stochastic Neighbor Embedding (t-SNE) is quite a mouthful. I will not go too much details into how the algorithm works but in short, t-SNE captures creating non-linear structure while respecting crowding problem.

An excerpt from this blog provides a helpful explanation. I have copied the most important bit for you.

  1. In the high-dimensional space, create a probability distribution that dictates the relationships between various neighboring points

  2. It then tries to recreate a low dimensional space that follows that probability distribution as best as possible.

It has its downside, “it still does make some important assumptions about the data. One assumption is that the local structure of the manifold is linear. The reason this assumption is important is that the distance between neighboring points is measured in Eucliedean distance, which assumes linearity. This may seem like a relatively benign assumption, but in complex manifolds where even the local structure is complex, this poses a significant problem. In such cases, methods like autoencoders might fair better”

After some magic, we can see this chart! Feel free to hover around it. I have made the notebook available that does the work of generating this pretty chart.

So what is this really?

To be safe, always use de

My analysis showed that majority of the words in dutch are de. From my sample of 21812 words, I observed that 76.45 % of the dutch words used de as article and 23.55% of the dutch words used het as article.

If the majority of the cases is de then it would be interesting to see when should I actually use het.

Always use het if the word ends with ‘je’

For example netje, balletje etc are diminutive form of word in Dutch so they are all all het words. In my corpus, 97% of words endings with ‘je’ are indeed het word. The rest seems to be exceptions:

non-dimunitive-weird-cases

I am not sure if that is not a grammatical mistake from the writer. For example, gewichtje is het so all of them could be het.

Use het if the word ends with ‘isme’

Upon inspecting the cluster, I saw a cluster of blue dots which represents het words. 50 or so words in this cluster seem to have suffix -isme.

suffix-isme-cluster

To verify this hypothesis, I filtered all words with suffix -isme and checked if that is really the case. Turns out that out of 151 words with suffix -isme, 141 words do indeed have het. There was one weird exception. The word ‘controlemechanisme’ seems to not follow this rule. It could be that there was typo in the corpus so I used my browser’s ‘mechanisme’ to ‘control’ this exception (sorry, not sorry). Turns out controlemechanisme is a het word so I can accept my hypothesis that all words with suffix -isme are het.

suffix-sme-cluster-exception

Use het if the compound word ends with ‘huis’

Compound words are combination of two words like you see below i.e voor+huis=voorhuis

compound word ending with het

I remember my teacher saying that compounds words always receive the article of the end word. Huis is a het word so any compound word made up of huis should also be het. Let’s verify this hypothesis. Upon inspecting my corpus, I found 94 words that do end with huis and the indeed all have het for article.

Generalizing, use het for the compound word whose last word is het word and use de for compound words whose last word is de word

There are 5137 het words in my corpus. Out of those words, 185 words have at least one compound word.

I searched for all compound words made of each het word and checked if the resulting compound word is also het. I was surprised by the result. It seems to be the case that on average 99.23% of het word’s compound word are also het word. The following chart shows that the word ‘huis’ resulted in most number of compound het words. This could be the reason why I could find this in a my cluster above in the first place.

compound word ending with het

Conversely, There are 16675 de words in my corpus. Out of those words, 1372 words have at least one compound word. I searched for all compound words made of each de word and checked if the resulting compound word is also de. It seems to be the case that on average 98.66% of de word’s compound word are also de word.

generalizing compound words

Conclusion

I learnt that that there are some helpful rules but there are exceptions to it. This short experiment has allowed me to ask questions about the rules for de/het in dutch and answer it using data. I learnt to use spaCy’s language feature better for both data extraction and visualization. The next step would be to create Machine learning model that can predict for me if a word is het/de based on this data. Stay tuned!

*****