Exercise: Extracting Named Entities

From a given list of word tokens with corresponding parts of speech, we want to extract all named entities. Extracting word tokens with POS "NE" would be too simple. The difficulty in this exercise is that we want to combine multi-token named entities as one named entity entry (by concatenating the tokens with spaces in-between). Whether or not a named entity is represented by a single or by multiple tokens can be determined as follows: If consecutive word tokens have the same POS tag "NE", it is assumed they belong to the same named entity.

Example
Here is an example. Let's assume the following list as input:

List(("Peter", "NE"), ("likes", "V"), ("New", "NE"), ("York", "NE"))
Then the corresponding result should be as follows:
List("Peter", "New York")
The method we want to implement should return a list of named entities containing both single and multi token entities -- as illustrated in the above example.

Prerequisites
We want to reuse the list of word tokens with corresponding POS tags from our implementation of PosLists from the previous exercise. Thus, this task assumes that you have successfully implemented PosLists.listOfWordTokensWithPos. As before, the word tokens and POS tags are extracted from the TüBa-D/Z treebank file tuebadz_1-50.utf8.

Your Task

Your task is to implement method def getNamedEntities(): List[String]. It should extract all named entities as described above.

Pull the latest version from Git. For this exercise we have prepared:

  • NamedEntities.scala: contains method stub getNamedEntities which needs to be implemented.
  • NamedEntitiesTest.scala: contains unit tests. Before using these unit tests, you have to activate them by replacing all occurrences of ignore with in.

 

Hints for a recursive implementation:
  1. Create a recursive inner helper method with the following signature: extractAndCombineNEs(wordTokensWithPos: List[(String, String)]): List[String]
  2. Base case: if the input list is empty, return an empty list.
  3. Recursive case: if the input list is non-empty, split it with the help of the collection method span into (i) the longest prefix sublist whose elements share the same POS and (ii) the remainder of the list (i.e., the 'rest' that does not belong to this longest prefix sublist).
    • If the tokens in the extracted prefix sublist are no NEs -> recursive call with the remainder of the list.
    • If the extracted prefix sublist tokens are NEs and the remainder of the list is empty -> stop the recursion and return a list containing a single element representing the prefix sublist NE.
    • If the extracted prefix sublist tokens are NEs and the remainder of the list is non-empty -> add a list element representing the prefix sublist NE to the result AND continue recursion with the remainder of the list.