Suggestions for your team projects

Suggestion 1: regular expressions

  • Write a program that can evaluate regular expressions. Your regular expressions should have character classes, disjunction, parenthesis, optionality, and the usual closure operators `+` and `*`.
  • Convert the regular expressions to finite automata, make the automata deterministic and minimize them.
  • Create functions that will let you evaluate your automata against text, finding all solutions incrementally.
  • Do a performance evaluation of your solution vs. Java regular expressions. Can you find examples where your solution is faster than Java regexes?

Suggestion 2: HMMs

  • Do your own HMM implementation. If you don't know how, check out Michael Collins's lecture notes here: http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf
  • Validate your implementation by training a POS tagger from some well-known data set, e.g., the Brown corpus. Split the corpus into a test and training set and check your results.
  • Use your HMM implementation for named entity recognition. Chose a domain and corpus yourselves, but it is best to start something where you have tagged data.
  • Try to implement an active learner with your HMM. Start training with a small sample only, then select specific samples to present to the human (or corpus) for further annotation. Try different strategies for selecting those samples, based on classification confidence. Which strategy works best, i.e., gives you the best gain with the smallest amount of training data?
  • Try your approach on a different domain and check if the results carry over.