TAGS :Viewed: 15 - Published at: a few seconds ago

[ Changing text of a column in a pandas dataframe ]

I am finding this seemingly simple operation rather tough. I have a data frame that has a column called CompanyId. Its values are 'COMP23', 'COMP55',..etc. Now, when I want to remove the prefix 'COMP' and make it numeric, it beats me. This is what I am doing:

df['companyId'] = df['companyId'].astype('str') # because type was 'object'.

df['companyId'].map(lambda x: int(x[4:]))

Where am I wrong? I noticed that the df was a series object.

Answer 1


Try:

df['companyId'] = df['companyId'].map(lambda x: int(str(x)[4:]))

Answer 2


You can use a regex pattern to extract all digits (\d+).

>>> df.CompanyId.str.extract(r'(\d+)')
0    23
1    55
Name: CompanyId, dtype: object

Note that your original method works fine.

>>> df['CompanyId'].astype('str').map(lambda x: int(x[4:]))
0    23
1    55
Name: CompanyId, dtype: int64

If there is an error, perhaps it is because there is an issue with the data.

df =  pd.DataFrame({'CompanyId': ['COMP23', 'COMP55', 'COMP', '', 'COM55']})    
df['CompanyId'].astype('str').map(lambda x: int(x[4:]))

ValueError: invalid literal for int() with base 10: ''

Note that the regex pattern still extracts the correct values:

>>> df.CompanyId.str.extract(r'(\d+)')
0     23
1     55
2    NaN
3    NaN
4     55  

Answer 3


try this:

In [210]: df['companyId'].str.replace('COMP','').astype(int)
Out[210]:
0     23
1     55
2    101
Name: companyId, dtype: int32

or

In [207]: df.companyId.str[4:].astype(int)
Out[207]:
0     23
1     55
2    101
Name: companyId, dtype: int32