Exercise: Compound Splitting
We want to implement a simple noun compound splitter, which is based solely on a list of simplex nouns. It should split compound words (written as one word, i.e., without spaces in-between) into sequences of simplex nouns where the simplex nouns have to exist in the provided list of nouns.
Example
Here is an example. Let's assume our list of simplex nouns contains the following words:
Apfel, Baum, Schnee, Mann, Auto, Bahn, Anschluss, Stelle, Druck, Drucker, Zeugnis, Erzeugnis, BeereThen our compound splitter should split compounds as follows:
- "Apfelbaum" -> List("Apfel", "Baum")
- "Schneemann" -> List("Schnee", "Mann")
- "Autobahnanschlussstelle" -> List("Auto", "Bahn", "Anschluss", "Stelle")
Ambiguous splits
For some compound words several splits are possible. For example, the German compound
"Druckerzeugnis" might be split into "Druck+Erzeugnis" or "Drucker+Zeugnis". We do not intend
to identify the 'correct' split or disambiguate between several possible splits. We simply
want to return all identified splits.
- "Druckerzeugnis" -> List("Druck", "Erzeugnis"), List("Drucker", "Zeugnis")
Bound morphemes
Compounds where not all constituent parts exist in the list of simplex nouns should
not be split. For example, although "Beere" is in the list of simplex nouns, "Himbeere"
should not be split, since "Him" is a bound morpheme and, thus, not in the list of simplex
nouns.
Your Task
Your task is to initialize the variable simplexNouns
and to implement the methods
split
and isCompound
.
val simplexNouns: List[String]
: should represent the list of simplex nouns from filenounListFile
, which is provided in the constructor of the class (class CompoundSplitter(val nounListFile: String)
). You have to read the file, e.g. with the help ofSource.fromURL(...)
.def split(s: String): List[List[String]]
: should split the given strings
into all appropriate concatenations of simplex nouns from the given list of nouns. If there is only a single possible split, the resulting list contains only one list of splits, e.g.,List(List("Schnee", "Mann"))
. However, in ambiguous cases such as for the word "Druckerzeugnis", the resulting list looks likeList(List("Druck", "Erzeugnis"), List("Drucker", "Zeugnis"))
.def isCompound(s: String): Boolean
: is supposed to determine whether or not the given string is a compound. It should return true if there is at least one possible split given the list of simplex nouns.
Pull the latest version from Git. For this exercise we have prepared:
- CompoundSplitter.scala: contains variable
simplexNouns
, which needs to be initialized appropriately, and method stubsisCompound
andsplit
which need to be implemented - CompoundSplitterTest.scala: contains unit tests. Before using these unit tests, you have
to activate them by replacing all occurrences of
ignore
within
. - simplex_nouns_en.dict: sample list of simplex English nouns
- simplex_nouns_de.dict: sample list of simplex German nouns
- Start with the straight-forward subtasks:
- Initialize variable
simplexNouns
by extracting all lines from filenounListFile
and converting them to a list. - The implementation of function
isCompound
is a simple boolean expression, which returnstrue
if the result of methodsplit
is not empty. - Implement method
split
: - Create a recursive local function that does the splitting.
- This recursive function takes two arguments:
s
: The remaining string to be split.acc
: An argument that tracks the so far accumulated result.- Base case: If the remaining string to be split is empty, the end of the string has been reached and the recursion stops.
- Recursive case: If the remaining string to be split is not empty
- Filter the list of simplex nouns for those that represent a starting
substring of
s
. - For each such filtered noun, get the substring of
s
where the simplex noun has been removed from the beginning. - Add the simplex noun to the accumulated result
acc
(adding an element to the end of a list can be done withlist :+ element
). - Recursive call to the local function with the remaining substring.
- Do not forget to include an initial call to the recursive splitting function,
where
s
is the original string and whereacc
is an initially empty list.