[ Beautifulsoup find element by text using `find_all` no matter if there are elements in it ]
For example
bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))
returns [<a>sometext</a>]
but when element searched for has a child, i.e. img
bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))
it returns []
Is there a way to use find_all
to match the later example?
Answer 1
You will need to use a hybrid approach since text=
will fail when an element has child elements as well as text.
bs = BeautifulSoup("<html><a>sometext</a></html>")
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
Background
When BeautifulSoup is searching for an element, and text
is a callable, it eventually eventually calls:
self._matches(found.string, self.text)
In the two examples you gave, the .string
method returns different things:
>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
The .string
method looks like this:
@property
def string(self):
"""Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
If we print out the contents we can see why this returns None
:
>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]