Exercise: Compound Splitting

We want to implement a simple noun compound splitter, which is based solely on a list of simplex nouns. It should split compound words (written as one word, i.e., without spaces in-between) into sequences of simplex nouns where the simplex nouns have to exist in the provided list of nouns.

Example
Here is an example. Let's assume our list of simplex nouns contains the following words:

Apfel, Baum, Schnee, Mann, Auto, Bahn, Anschluss, Stelle, Druck, Drucker, Zeugnis, Erzeugnis, Beere
Then our compound splitter should split compounds as follows:
  • "Apfelbaum" -> List("Apfel", "Baum")
  • "Schneemann" -> List("Schnee", "Mann")
  • "Autobahnanschlussstelle" -> List("Auto", "Bahn", "Anschluss", "Stelle")

Ambiguous splits
For some compound words several splits are possible. For example, the German compound "Druckerzeugnis" might be split into "Druck+Erzeugnis" or "Drucker+Zeugnis". We do not intend to identify the 'correct' split or disambiguate between several possible splits. We simply want to return all identified splits.

  • "Druckerzeugnis" -> List("Druck", "Erzeugnis"), List("Drucker", "Zeugnis")

Bound morphemes
Compounds where not all constituent parts exist in the list of simplex nouns should not be split. For example, although "Beere" is in the list of simplex nouns, "Himbeere" should not be split, since "Him" is a bound morpheme and, thus, not in the list of simplex nouns.

Your Task

Your task is to initialize the variable simplexNouns and to implement the methods split and isCompound.

  • val simplexNouns: List[String]: should represent the list of simplex nouns from file nounListFile, which is provided in the constructor of the class (class CompoundSplitter(val nounListFile: String)). You have to read the file, e.g. with the help of Source.fromURL(...).
  • def split(s: String): List[List[String]]: should split the given string s into all appropriate concatenations of simplex nouns from the given list of nouns. If there is only a single possible split, the resulting list contains only one list of splits, e.g., List(List("Schnee", "Mann")). However, in ambiguous cases such as for the word "Druckerzeugnis", the resulting list looks like List(List("Druck", "Erzeugnis"), List("Drucker", "Zeugnis")).
  • def isCompound(s: String): Boolean: is supposed to determine whether or not the given string is a compound. It should return true if there is at least one possible split given the list of simplex nouns.

Pull the latest version from Git. For this exercise we have prepared:

  • CompoundSplitter.scala: contains variable simplexNouns, which needs to be initialized appropriately, and method stubs isCompound and split which need to be implemented
  • CompoundSplitterTest.scala: contains unit tests. Before using these unit tests, you have to activate them by replacing all occurrences of ignore with in.
  • simplex_nouns_en.dict: sample list of simplex English nouns
  • simplex_nouns_de.dict: sample list of simplex German nouns

 

Hints:
  1. Start with the straight-forward subtasks:
    • Initialize variable simplexNouns by extracting all lines from file nounListFile and converting them to a list.
    • The implementation of function isCompound is a simple boolean expression, which returns true if the result of method split is not empty.
  2. Implement method split:
    • Create a recursive local function that does the splitting.
    • This recursive function takes two arguments:
      1. s: The remaining string to be split.
      2. acc: An argument that tracks the so far accumulated result.
    • Base case: If the remaining string to be split is empty, the end of the string has been reached and the recursion stops.
    • Recursive case: If the remaining string to be split is not empty
      • Filter the list of simplex nouns for those that represent a starting substring of s.
      • For each such filtered noun, get the substring of s where the simplex noun has been removed from the beginning.
      • Add the simplex noun to the accumulated result acc (adding an element to the end of a list can be done with list :+ element).
      • Recursive call to the local function with the remaining substring.
    • Do not forget to include an initial call to the recursive splitting function, where s is the original string and where acc is an initially empty list.