TAGS :Viewed: 6 - Published at: a few seconds ago

[ Optional regex groups for comic book titles (python) ]

I am trying to extract relevant information from multiple comic book titles with varying content. However, there are only about 5 or 6 different patterns that are followed:

Examples are:

Green Lantern #21

Green Lantern #21 (Variant Cover Edition)

Dejah Thoris & Green Men Of Mars #4 (of 8)

Dejah Thoris & Green Men Of Mars #4 (of 8) (Variant Cover Edition)

Macabre One Shot

Detective Comics #21 Combo Pack

I want to capture in groups:

  1. Title (the only REQUIRED group)
  2. Issue Number
  3. Total number of issues, e.g. (of 8)
  4. All other information, e.g. (Varient Cover Edition) or 'Combo Pack'

I have the beginnings of a regex search string started but am having trouble making things reliably optional,

(?P<name>.*?)\s*?(?P<issue_number>#\d*)\s*?(?P<info>.*)

It is definitely not complete. Any help anyone could give me would be greatly appreciated.

Thanks in advance!!!

Answer 1


The problem with optional groups is that the regex engine does not really look for them; it only checks for their presence at the current position where the processing has lead to.

Using ([^#]+) to capture the title puts the engine at the right position to match the issue number if it's present. If you don't want whitespace at the end of title, use ([^#]*[^#\s])\s* instead.

import re

strings = ['Green Lantern #21', 
    'Green Lantern #21 (Variant Cover Edition)', 
    'Dejah Thoris & Green Men Of Mars #4 (of 8)', 
    'Dejah Thoris & Green Men Of Mars #4 (of 8) (Variant Cover Edition)', 
    'Macabre One Shot', 
    'Detective Comics #21 Combo Pack']

for s in strings:
    print re.match(r'([^#]*[^#\s])\s*(?:#(\d+)\s*)?(?:\(of (\d+)\)\s*)?(.+)?', s).groups()

prints

('Green Lantern', '21', None, None)
('Green Lantern', '21', None, '(Variant Cover Edition)')
('Dejah Thoris & Green Men Of Mars', '4', '8', None)
('Dejah Thoris & Green Men Of Mars', '4', '8', '(Variant Cover Edition)')
('Macabre One Shot', None, None, None)
('Detective Comics', '21', None, 'Combo Pack')

Answer 2


You can try this regex

^(?P<name>.+?)(\s+(?P<issue_number>#\d+))?(\s+(?P<issues>\(of\s*\d+\)))?(\s+(?P<other>\(Variant Cover Edition\)|Combo Pack))?$

Explanation

^  # beginning of string
(?P<name>.+?)   # Captures the name
(\s+(?P<issue_number>#\d+))?   # captures the issue number optionally
(\s+(?P<issues>\(of\s*\d+\)))?   # captures the number of issues optionally
(\s+(?P<other>\(Variant Cover Edition\)|Combo Pack))?   # captures other info optionally
$ # end of string

If your input contains multiple such inputs,you should remove ^,$