Categories:Viewed: 58 - Published at: 6 months ago

Introduction

Converting text into a uniform case is a good starting point of any type for text processing.

In this article we'll show you how to convert text to lowercase using one of the Pythons' built-in methods used to manipulate strings - str.lower().

From a top-level view, the process is acheived through:

exampleString = "[email protected]$"
lowercaseString = exampleString.lower()
print(lowercaseString) # [email protected]$

However, and especially if you're new to Python - read on. We'll discuss one alternative approach for converting strings to lowercase, so that you have a comprehensive overview of the subject. After reading the article, you'll be able to convert any string to lowercase, know when to simply use the str.lower() method, and when to choose the alternative approach instead.

How to Convert String to Lowercase in Python

Converting strings to lowercase is pretty straightforward to perform in Python. str.lower() is the built-in method made specifically for that purpose. It is defined as a method of the String type which is built into the Python namespace.

Note: Every Python built-in type has a set of methods designed to perform operations on that specific type. For example, the String type has predefined methods for removing leading and trailing whitespaces, finding and replacing substrings, splitting strings into arrays, etc. One of those methods is str.lower().
Every method defined for the String type has the str prefix by its name in the documentation. That suggests that all of them are called on string instances.
        The str.lower() method returns a lowercase copy of the string on which it is called. That can be useful when you want to save the original string for later use. Now, let's see how to convert a string to lowercase in Python.

Let's assume that you have some string that you want to convert:

exampleString = "[email protected]$"

As you can see, this string has both lowercase and uppercase letters, as well as some special characters. Only the letters can be converted into lowercase, so you would expect the lowercase version of this string to be "<a href="/cdn-cgi/l/email-protection.html" class="__cf_email__" data-cfemail="0766656463626158364734">[email&nbsp;protected]</a>$":

lowercaseString = exampleString.lower()
print(lowercaseString) # [email&nbsp;protected]$

After calling the str.lower() method on the exampleString, its lowercase copy is stored as a new object, referenced by lowercaseString. Just to make sure that the str.lower() method produces the correct output, let's compare the lowercaseString to the expected lowercase version of the exampleString:

if(lowercaseString == "[email&nbsp;protected]$"):
    print("Lowercase string EQUAL to expected string!")
else:
    print("Lowercase string NOT EQUAL to expected string!")

This piece of code will output:

"Lowercase string EQUAL to expected string!"

Awesome!

Note: The opposite method to the str.lower() method is the str.upper(). It is used in the same fashion as the str.lower(). Also, you can check if a string is all-lowercase or all-uppercase by using the str.islower() or str.isupper().
        <h3 id="whyusestrcasefoldisnteadofstrlower">Why Use <em>str.casefold()</em> isntead of <em>str.lower()</em></h3>

The previous method is suitable for most use cases. It does what it is supposed to do by following a few simple rules. Starting with Python 3.0, strings are represented as arrays of Unicode characters which makes it easy for str.lower() to replace every code of a capital letter with the code of the corresponding lowercase letter. That principle works fine in almost all use-cases, but there are some instances where you should consider using the str.casefold() method instead. For example, when implementing caseless matching of two strings, the str.casefold() is the way to go. Since Python uses Unicode to represent strings, all rules defined in the Unicode Standard apply to the Python as well. In section 3.13 the Standard states the following:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Because the str.casefold() is the Python implementation of the Unicode method toCasefold(), you should use it when implementing caseless matching.

Note: Both X.casefold() and toCasefold(X) methods map each character of the string X into its casefold correspondent, as defined in the CaseFolding.txt file in the Unicode Character Database.
        To illustrate the difference between str.lower() and str.casefold(), let's take a look at the example of the German letter "ß", a lowercase letter which is equal to "ss". That means that the following strings are supposed to be the exact caseless match:
A = "ßaBcß"
B = "ssAbCss"

But if you try to compare them using the str.lower() method, you won't get the expected result:

Al = A.lower()
Bl = B.lower()

print(Al == Bl)
# Output: False

This comparison will produce the False value, meaning that A.lower() is not equal to B.lower(). That is because the "ß" is already a lowercase letter, so the str.lower() method won't change it. Therefore, Al and Bl have the following values:

Al = "ßabcß"
Bl = "ssabcss"

Obviously, Al is not equal to Bl, thus the previous comparison must produce the False value. To correct this unexpected behavior, you should use the str.casefold() method. It is more aggressive compared to the str.lower() because it will remove absolutely all differences in letter casing in a string. Therefore, the "ß" will be replaced by "ss" and you can caseless match strings A and B:

Ac = A.casefold()
# Ac = "ssabcss"

Bc = B.casefold()
# Bc = "ssabcss"

Now, if you compare casefolded strings A and B, you'll get the expected result, the same as defined in the Unicode Standard:

print(Ac == Bc)
# Output: True
Alert: The shown type of caseless matching is called default caseless matching, which is the most basic type of caseless matching defined by the Unicode Standard.
There are three more types of caseless matching defined in the Unicode Standard - canonical, compatibility, and identifier caseless matching. Each of them implies that one or more steps are added in order to improve the correctness of the matching in more specific use-cases. Those additional steps are usually consisted of normalizing strings during the matching (which is not performed during the default caseless matching).
        <h3 id="problemswithstrcasefold">Problems with <em>str.casefold()</em></h3>

Even though str.casefold() is a built-in Python method intended to implement the toCasefold() method from the Unicode Standard, you definitely shouldn't use it carelessly. There are some edge cases where it won't produce the desired result. For example, the Unicode Standard defines the casefolded (lowercase) version of the capital letter I as i, which is in line with its use in most languages. But that mapping can't work in Turkish. The Turkish language has two variants of the letter I with their own lowercase correspondents:

  • uppercase i - I (similar to the usual uppercase letter i)
    • with its lowercase variant - ı
  • dotted uppercase i - İ
    • with its lowercase variant - i (similar to the usual lowercase letter i)

Consequently, the standard Unicode coding wouldn't work in the Turkish language. Because of that, the Unicode Standard defined two different casefolding mappings - for Turkish and non-Turkish languages. Turkish variant takes into the count mentioned nuances from the Turkish language and non-Turkish variant maps the uppercase I to its usual lowercase counterpart i. On the other hand, the str.casefold() uses only the default (non-Turkish) casefold mapping, therefore it can't perform caseless matching in Turkish for some words.

Note: Therefore, it is said that the str.casefold() doesn't pass the Turkish test!
        For example, the following strings are supposed to be a caseless match in Turkish:
str1 = "Iabcİ"
str2 = "ıabci"

But, in practice, the usual comparison will yield the False value:

print(str1.casefold() == str2.casefold())
# "iabci" == "ıabci"
# Outputs: False

Note that the str.casefold() converted both I and İ into the same lowercase character - i, which is according to the standard (non-Turkish) casefold mapping of the Unicode Standard. That is what caused the undesired result of this caseless matching. This example illustrates the case when str.casefold() produces the incorrect result of caseless matching in a specific language. Therefore, you should pay attention to the specifics of the language you are working with.

Conclusion

After reading this guide, you will understand what's the most generic way to convert a string to lowercase in Python, as well as what is the alternative approach. We've shortly covered the str.lower() method in Python and then dove into the details of the str.casefold() method. We've covered its basic use cases, compared them to the str.lower() method, explained basic concepts and standards surrounding the Python implementation of the casefolding algorithm. In the end, we've discussed some problematic use-cases so that you can be aware of some of the undesired results that the str.casefold() method can produce.

Reference: stackabuse.com

TAGS :