Drawing

BeyondCalories Diaries : Ingredient Parsing

Drawing

One of the important phases of successfully extracting nutritional information from free-text recipes is the extraction part.
Initially I thought about making a full-blown, domains-specific ML (Machine Learning) based IE extractor, but to make simple a regex (Regular Expression) based approach was chosen. Another motivation in choosing a regex-based extractor was the limited vocabulary of the ingredients.

Once I decided on regex, it becomes obvious that the ingredients come in too-many shapes and sizes.

  • 12 cups lettuce
  • 14 large, fresh eggs
  • apple, cored, peeled
  • 1-2 tsp. fresh lime juice
  • salt to taste
  • pepper
  • 1/4 cup plus 2 tablespoons grated Parmesan cheese
  • Two 15-ounce cans chickpeas (4 cups), rinsed and drained

Interesting to note that in these seemingly endless variations, there is a hiden structure. Each line contains, in order -

  • amount/quantity (optional). ex - 12
  • unit (optional). ex - cups
  • Pre-modifiers (optional). ex - grated
  • food item. ex - Parmesan cheese
  • Post-modifiers (optional). ex - cored, peeled

Normally I'm a GDD (Guilt-Driven Developer) than a TDD (I've already discussed the role of a TDD and where it makes sense. It doesn't make any sense when you are early stage of your startup or researching/inventing something new. There are no requirements baked yet. On the contrary, it makes perfect sense for Big Companies (behaving agile) and Agencies where you have to just focus on the code quality rather than discovering a market or exploring R&D options). But here is a classic example of writing specs before embarking on coding the extractor.

describe IngredientParser do  
  it ""should parse - <12 cups lettuce>"" do    
  end
  …
end  

The entire list can be found here - https://gist.github.com/dbose/73f30d65efa8f29fce16

Now as far as the implementation goes, a naive hand-coded regex would be too much to extract the latent strucure. As far as Rule-based Systems goes, a parser would be a nice way to manage this insane task. A NLP-based NP-Chunker (POS Tagger with Regex thrown to it) would be a nice fit as well. But couldn't found one written in pure Ruby without resorting to JRuby-based implementations like OpenNLP Ruby - https://github.com/louismullie/open-nlp.

Instead I decided to write the grammer for a custom parser based on widely used TreeTop (https://github.com/nathansobo/treetop) or Citrus (https://github.com/mjijackson/citrus). At the end I've chosen Citrus without any specific bias.

grammar Ingredient  
    rule ingredient_line        
        qualifier* quantity* (space|'-')* unit* additional_unit* unit_suffix* pre_modifiers* base_ingredient* post_modifier end
end  

The cool part of citrus is that the the whole "grammer" is a souped-up ruby module and each rule is a souped-up method in that module. Just like real modules, rules can be inherited from a base grammer module. The first rule is the "main" rule which can compose of many sub-rules. Let's look at how I coded rule for quantity

rule quantity  
    # captures 
    #
    # => 1 1/2 tbsp...
    # => 4 to 6 pitas
    # => 4 - 6 oz fresh mozzarella
    # => 12-to 14-pounds tomatoes
    # => 4-6 pitas
    # => 3- to 3-1/2-pound
    (numeric space* (quantity_fraction_pure | quantity_range)*)1*
end  

Let's take an example of "3- to 3-1/2-pound". 3 matches numeric rule -

rule numeric  
    float | quantity_fraction | number | numeric_words
end  

space is optional (denoted with a * symbol indicating 0/1 occurance). Subsequent - doesn't match the quantity_fraction_pure rule. Let's explore the quantity_range rule. Often quantities are mentioned in a range-like fashion. This rule should tame them all !

rule quantity_range  
  ('-to' | 'to' | '-' ) space ('to' space)* numeric ('-')*
end  

It matches - to 3-1/2. The later part of the quantity-range 3-1/2 is a fraction and is matched by the quantity_fraction rule -

rule quantity_fraction  
  (number quantity_suffix number quantity_suffix* number*)
end  

quantity_suffix is a terminal rule which is atomic (meaning it has no futher composition)

rule quantity_suffix  
  '-' | '/'
end  

Citrus is good but could have been better. In order to compete with pyparsing (http://pyparsing.wikispaces.com/), one important feature missing in Citrus is the ability to hook custom function at the parser stream. This opens up interesting possibilities - for example

rule pre_modifier  
  lematized('peeled')
end  

This is not exactly sexy compared to a ML-based approach. But it kind of fits into the limited vocabulary of recipe ingredients.