BeyondCalories Diaries : Ingredient Parsing
One of the important phases of successfully extracting nutritional information from free-text recipes is the extraction part.
Initially I thought about making a full-blown, domains-specific ML (Machine Learning) based IE extractor, but to make simple a regex (Regular Expression) based approach was chosen. Another motivation in choosing a regex-based extractor was the limited vocabulary of the ingredients.
Once I decided on regex, it becomes obvious that the ingredients come in too-many shapes and sizes.
- 12 cups lettuce
- 14 large, fresh eggs
- apple, cored, peeled
- 1-2 tsp. fresh lime juice
- salt to taste
- 1/4 cup plus 2 tablespoons grated Parmesan cheese
- Two 15-ounce cans chickpeas (4 cups), rinsed and drained
Interesting to note that in these seemingly endless variations, there is a hiden structure. Each line contains, in order -
- amount/quantity (optional). ex - 12
- unit (optional). ex - cups
- Pre-modifiers (optional). ex - grated
- food item. ex - Parmesan cheese
- Post-modifiers (optional). ex - cored, peeled
Normally I'm a GDD (Guilt-Driven Developer) than a TDD (I've already discussed the role of a TDD and where it makes sense. It doesn't make any sense when you are early stage of your startup or researching/inventing something new. There are no requirements baked yet. On the contrary, it makes perfect sense for Big Companies (behaving agile) and Agencies where you have to just focus on the code quality rather than discovering a market or exploring R&D options). But here is a classic example of writing specs before embarking on coding the extractor.
describe IngredientParser do it ""should parse - <12 cups lettuce>"" do end … end
The entire list can be found here - https://gist.github.com/dbose/73f30d65efa8f29fce16
Now as far as the implementation goes, a naive hand-coded regex would be too much to extract the latent strucure. As far as Rule-based Systems goes, a parser would be a nice way to manage this insane task. A NLP-based NP-Chunker (POS Tagger with Regex thrown to it) would be a nice fit as well. But couldn't found one written in pure Ruby without resorting to JRuby-based implementations like OpenNLP Ruby - https://github.com/louismullie/open-nlp.
Instead I decided to write the grammer for a custom parser based on widely used TreeTop (https://github.com/nathansobo/treetop) or Citrus (https://github.com/mjijackson/citrus). At the end I've chosen Citrus without any specific bias.
grammar Ingredient rule ingredient_line qualifier* quantity* (space|'-')* unit* additional_unit* unit_suffix* pre_modifiers* base_ingredient* post_modifier end end
The cool part of citrus is that the the whole "grammer" is a souped-up ruby module and each rule is a souped-up method in that module. Just like real modules, rules can be inherited from a base grammer module. The first rule is the "main" rule which can compose of many sub-rules. Let's look at how I coded rule for
rule quantity # captures # # => 1 1/2 tbsp... # => 4 to 6 pitas # => 4 - 6 oz fresh mozzarella # => 12-to 14-pounds tomatoes # => 4-6 pitas # => 3- to 3-1/2-pound (numeric space* (quantity_fraction_pure | quantity_range)*)1* end
Let's take an example of "3- to 3-1/2-pound".
numeric rule -
rule numeric float | quantity_fraction | number | numeric_words end
space is optional (denoted with a
* symbol indicating 0/1 occurance). Subsequent
- doesn't match the
quantity_fraction_pure rule. Let's explore the
quantity_range rule. Often quantities are mentioned in a range-like fashion. This rule should tame them all !
rule quantity_range ('-to' | 'to' | '-' ) space ('to' space)* numeric ('-')* end
- to 3-1/2. The later part of the quantity-range
3-1/2 is a fraction and is matched by the
quantity_fraction rule -
rule quantity_fraction (number quantity_suffix number quantity_suffix* number*) end
quantity_suffix is a terminal rule which is atomic (meaning it has no futher composition)
rule quantity_suffix '-' | '/' end
Citrus is good but could have been better. In order to compete with
pyparsing (http://pyparsing.wikispaces.com/), one important feature missing in
Citrus is the ability to hook custom function at the parser stream. This opens up interesting possibilities - for example
rule pre_modifier lematized('peeled') end
This is not exactly sexy compared to a ML-based approach. But it kind of fits into the limited vocabulary of recipe ingredients.