TAGS :Viewed: 6 - Published at: a few seconds ago

[ Beautifulsoup find element by text using `find_all` no matter if there are elements in it ]

For example

bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))

returns [<a>sometext</a>] but when element searched for has a child, i.e. img

bs = BeautifulSoup("&lt;html&gt;&lt;a&gt;sometext&lt;img /&gt;&lt;/a&gt;&lt;/html&gt;")
print bs.find_all("a",text=re.compile(r"some"))

it returns []

Is there a way to use find_all to match the later example?

Answer 1

You will need to use a hybrid approach since text= will fail when an element has child elements as well as text.

bs = BeautifulSoup("<html><a>sometext</a></html>")    
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]


When BeautifulSoup is searching for an element, and text is a callable, it eventually eventually calls:

self._matches(found.string, self.text)

In the two examples you gave, the .string method returns different things:

>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string

The .string method looks like this:

def string(self):
    """Convenience property to get the single string within this tag.

    :Return: If this tag has a single string child, return value
     is that string. If this tag has no children, or more than one
     child, return value is None. If this tag has one child tag,
     return value is the 'string' attribute of the child tag,
    if len(self.contents) != 1:
        return None
    child = self.contents[0]
    if isinstance(child, NavigableString):
        return child
    return child.string

If we print out the contents we can see why this returns None:

>>> print bs1.find('a').contents
>>> print bs2.find('a').contents
[u'sometext', <img/>]