TAGS :Viewed: 6 - Published at: a few seconds ago

[ Optional regex groups for comic book titles (python) ]

I am trying to extract relevant information from multiple comic book titles with varying content. However, there are only about 5 or 6 different patterns that are followed:

Examples are:

Green Lantern #21

Green Lantern #21 (Variant Cover Edition)

Dejah Thoris & Green Men Of Mars #4 (of 8)

Dejah Thoris & Green Men Of Mars #4 (of 8) (Variant Cover Edition)

Macabre One Shot

Detective Comics #21 Combo Pack

I want to capture in groups:

  1. Title (the only REQUIRED group)
  2. Issue Number
  3. Total number of issues, e.g. (of 8)
  4. All other information, e.g. (Varient Cover Edition) or 'Combo Pack'

I have the beginnings of a regex search string started but am having trouble making things reliably optional,


It is definitely not complete. Any help anyone could give me would be greatly appreciated.

Thanks in advance!!!

Answer 1

The problem with optional groups is that the regex engine does not really look for them; it only checks for their presence at the current position where the processing has lead to.

Using ([^#]+) to capture the title puts the engine at the right position to match the issue number if it's present. If you don't want whitespace at the end of title, use ([^#]*[^#\s])\s* instead.

import re

strings = ['Green Lantern #21', 
    'Green Lantern #21 (Variant Cover Edition)', 
    'Dejah Thoris & Green Men Of Mars #4 (of 8)', 
    'Dejah Thoris & Green Men Of Mars #4 (of 8) (Variant Cover Edition)', 
    'Macabre One Shot', 
    'Detective Comics #21 Combo Pack']

for s in strings:
    print re.match(r'([^#]*[^#\s])\s*(?:#(\d+)\s*)?(?:\(of (\d+)\)\s*)?(.+)?', s).groups()


('Green Lantern', '21', None, None)
('Green Lantern', '21', None, '(Variant Cover Edition)')
('Dejah Thoris & Green Men Of Mars', '4', '8', None)
('Dejah Thoris & Green Men Of Mars', '4', '8', '(Variant Cover Edition)')
('Macabre One Shot', None, None, None)
('Detective Comics', '21', None, 'Combo Pack')

Answer 2

You can try this regex

^(?P<name>.+?)(\s+(?P<issue_number>#\d+))?(\s+(?P<issues>\(of\s*\d+\)))?(\s+(?P<other>\(Variant Cover Edition\)|Combo Pack))?$


^  # beginning of string
(?P<name>.+?)   # Captures the name
(\s+(?P<issue_number>#\d+))?   # captures the issue number optionally
(\s+(?P<issues>\(of\s*\d+\)))?   # captures the number of issues optionally
(\s+(?P<other>\(Variant Cover Edition\)|Combo Pack))?   # captures other info optionally
$ # end of string

If your input contains multiple such inputs,you should remove ^,$