BeyondCalories Diaries : Ingredient Parsing

Drawing

One of the important phases of successfully extracting nutritional information from free-text recipes is the extraction part.
Initially I thought about making a full-blown, domains-specific ML (Machine Learning) based IE extractor, but to make simple a regex (Regular Expression) based approach was chosen. Another motivation in choosing a regex-based extractor was the limited vocabulary of the ingredients.

Once I decided on regex, it becomes obvious that the ingredients come in too-many shapes and sizes.

  • 12 cups lettuce
  • 14 large, fresh eggs
  • apple, cored, peeled
  • 1-2 tsp. fresh lime juice
  • salt to taste
  • pepper
  • 1/4 cup plus 2 tablespoons grated Parmesan cheese
  • Two 15-ounce cans chickpeas (4 cups), rinsed and drained

Interesting to note that in these seemingly endless variations, there is a hiden structure. Each line contains, in order -

  • amount/quantity (optional). ex - 12
  • unit (optional). ex - cups
  • Pre-modifiers (optional). ex - grated
  • food item. ex - Parmesan cheese
  • Post-modifiers (optional). ex - cored, peeled

Normally I'm a GDD (Guilt-Driven Developer) than a TDD (I've already discussed the role of a TDD and where it makes sense. It doesn't make any sense when you are early stage of your startup or researching/inventing something new. There are no requirements baked yet. On the contrary, it makes perfect sense for Big Companies (behaving agile) and Agencies where you have to just focus on the code quality rather than discovering a market or exploring R&D options). But here is a classic example of writing specs before embarking on coding the extractor.

describe IngredientParser do
  it ""should parse - <12 cups lettuce>"" do	
  end
  …
end

The entire list can be found here - https://gist.github.com/dbose/73f30d65efa8f29fce16

Now as far as the implementation goes, a naive hand-coded regex would be too much to extract the latent strucure. As far as Rule-based Systems goes, a parser would be a nice way to manage this insane task. A NLP-based NP-Chunker (POS Tagger with Regex thrown to it) would be a nice fit as well. But couldn't found one written in pure Ruby without resorting to JRuby-based implementations like OpenNLP Ruby - https://github.com/louismullie/open-nlp.

Instead I decided to write the grammer for a custom parser based on widely used TreeTop (https://github.com/nathansobo/treetop) or Citrus (https://github.com/mjijackson/citrus). At the end I've chosen Citrus without any specific bias.

grammar Ingredient	
	rule ingredient_line		
		qualifier* quantity* (space|'-')* unit* additional_unit* unit_suffix* pre_modifiers* base_ingredient* post_modifier	end
end 

The cool part of citrus is that the the whole "grammer" is a souped-up ruby module and each rule is a souped-up method in that module. Just like real modules, rules can be inherited from a base grammer module. The first rule is the "main" rule which can compose of many sub-rules. Let's look at how I coded rule for quantity

rule quantity
	# captures 
	#
	# => 1 1/2 tbsp...
	# => 4 to 6 pitas
	# => 4 - 6 oz fresh mozzarella
	# => 12-to 14-pounds tomatoes
	# => 4-6 pitas
	# => 3- to 3-1/2-pound
	(numeric space* (quantity_fraction_pure | quantity_range)*)1*
end

Let's take an example of "3- to 3-1/2-pound". 3 matches numeric rule -

rule numeric
	float | quantity_fraction | number | numeric_words
end

space is optional (denoted with a * symbol indicating 0/1 occurance). Subsequent - doesn't match the quantity_fraction_pure rule. Let's explore the quantity_range rule. Often quantities are mentioned in a range-like fashion. This rule should tame them all !

rule quantity_range
  ('-to' | 'to' | '-' ) space ('to' space)* numeric ('-')*
end

It matches - to 3-1/2. The later part of the quantity-range 3-1/2 is a fraction and is matched by the quantity_fraction rule -

rule quantity_fraction
  (number quantity_suffix number quantity_suffix* number*)
end

quantity_suffix is a terminal rule which is atomic (meaning it has no futher composition)

rule quantity_suffix
  '-' | '/'
end

Citrus is good but could have been better. In order to compete with pyparsing (http://pyparsing.wikispaces.com/), one important feature missing in Citrus is the ability to hook custom function at the parser stream. This opens up interesting possibilities - for example

rule pre_modifier
  lematized('peeled')
end

This is not exactly sexy compared to a ML-based approach. But it kind of fits into the limited vocabulary of recipe ingredients.