TAGS :Viewed: 10 - Published at: a few seconds ago

[ text file production with looped findings ]

I have a text file which contains 32 articles.I manage to find each article with the following code:

import re 
sections = [] 
current = []
with open("Aberdeen2005.txt") as f:
    for line in f:
        if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):        
           sections.append("".join(current))
           current = [line]
        else:
           current.append(line)

print(len(sections)) 

Next thing I did was to look how many articles have keywords that I am interested in: tax and policy. In this line, if the article has it I extract the month:

months=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'Novemeber', 'December']


for i in range(len(sections)): 

    if (' tax ' in sections[i]
    or ' Tax ' in sections[i]
    or ' policy ' in sections[i]
    or ' Policy ' in sections[i]):

        pat=re.compile("|".join([r"\b{}\b".format(m) for m in months]), re.M)
        month = pat.search("\n".join(sections[i].splitlines()[0:6]))
        print(month)

Last but not least, I want to create a text file with the months previously found:

outfile = open('C:/Users/nn/Desktop/Uncertainty_Scot/dates.txt', 'w')
outfile.write(month.group(0))
outfile.close

Here is where the problem is, it only produces the last month. I guess is because it is not in the loop, any ideas how to do it?

Kind regards!

Answer 1


You just need to wrap your loop in a with loop for your output file as follows:

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

with open(r'C:\Users\nn\Desktop\Uncertainty_Scot\dates.txt', 'w') as outfile:
    for i in range(len(sections)): 
        if (' tax ' in sections[i] or ' Tax ' in sections[i] or ' policy ' in sections[i] or ' Policy ' in sections[i]):
            pat = re.compile("|".join([r"\b{}\b".format(m) for m in months]), re.M)
            month = pat.search("\n".join(sections[i].splitlines()[0:6]))
            print(month)
            outfile.write(month.group(0))

You could further improve you loop by doing something like the following:

months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

with open('C:/Users/nn/Desktop/Uncertainty_Scot/dates.txt', 'w') as outfile:
    for s in sections: 
        if any(x in s.lower() for x in [' tax ', ' policy ']:
            pat = re.compile("|".join([r"\b{}\b".format(m) for m in months]), re.M)
            month = pat.search("\n".join(s.splitlines()[0:6]))
            print(month)
            outfile.write(month.group(0))

By first converting to lowercase, you only have to test for one version of the string, it would also then catch entries of the form " TAX ".