TAGS :Viewed: 6 - Published at: a few seconds ago

[ Extracting words from text using python regex ]

I have a text (string) and I want to perform this task in python:

I perform the CountVectorizer method in order to make a bag of words. You may find this method here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

This method includes stopWords removal and it works fine. It removes any punctuation and break every word. But besides the words it returns lots of trash like single letters and numbers.

This method though, has one parameter called "token_pattern" that takes a string (regex) that can give me better results.

What i want to do is: a) Exlude Any words that start, end or include numbers. b) exclude any numbers from text c) exclude any words <= 2 letters b) exclude all the http pages

For example, this regex should give me this:

text = "It can be dangerous to take Fido for a ride: http://t.co/eR2WfAnZBI http://t.co/RF3bhPNPwR',each year, on average, 20 billion empty miles are incurred by trucks, which costs the economy billions"

final_text = "can dangerous take Fido for ride each year average billion empty miles are incurred trucks which costs the economy billions"

I Thanks in advance for your time and attention :)

Answer 1


Here is a piece of regex that grabs any word made up of solely letters of length 3 or more.

[a-zA-Z]{3,}

Here is a piece of regex that grabs any line without a URL in it.

^((?!(https?:\/\/)+([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w=?$#% \.-]*)).)*$

I haven't figured out how to combine the two yet. But at the very least, this is a step in the right direction. You could put each word on its own line, then remove urls, then match words of 3 or more letters. Ugly, but would work.

Answer 2


I don't know python but regex is the same for any programming language so my answer is :

"(\s?\w+[0-9]+\w+\s?)|([0-9]+)|(\s\w\w\s)|(http://t.co/)"g