TAGS :Viewed: 10 - Published at: a few seconds ago

[ Python script for CSV handling ]

I'm trying to write a python script whereby the code acts on all CSV files in all subfolders of a particular main folder, where CSVs can have similar names, and the code outputs an edited CSV file for each input CSV. The code is below:

import os
import csv
import sys


for root, dirs, files in os.walk(sys.argv[1]):
    for filename in files:
        i = 0
        while os.path.exists(str(i)+filename):
            i += 1
        with open(str(i)+filename, "wb") as g:
            symbol = filename.rpartition('_')[-1].rpartition('.')[0] 
            reader = csv.reader(filename)
            writer = csv.writer(g, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
            for row in reader:
                row.insert(0, symbol.upper())
                writer.writerow(row)

To account for the possibility that I may have similar file names, I'm using an iterative string in the output filename. I also want to add a column into the output CSV such that its content is borrowed from the input filename - this is the purpose of rpartition and the rest of the code. However...my output for one of the CSV inputs looks like:

"AA"    "t"
"AA"    "a"
"AA"    "b"
"AA"    "l"
"AA"    "e"
"AA"    "_"
"AA"    "a"
"AA"    "a"
"AA"    "."
"AA"    "c"
"AA"    "s"
"AA"    "v"

The "AA" part is correct - its the symbol for that input file. However, I don't know why it proceeds to display the name of the input file (table_aa.csv) in the output, as I want the contents of the input displayed next to the "AA" column.

Could someone please and kindly have a look at the code and let me know?

****************EDIT***************** Edited the code to:

import os
import csv
import sys


for root, dirs, files in os.walk(sys.argv[1]):
    for filename in files:
        pathname = os.path.join(root, filename)
        i = 0
        while os.path.exists(str(i)+filename):
            i += 1
        with open(str(i)+filename, "wb") as g:
            symbol = filename.rpartition('_')[-1].rpartition('.')[0]
            reader = csv.reader(pathname)
            writer = csv.writer(g, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
            for row in reader:
                row.insert(0, symbol.upper())
                writer.writerow(row)

But now the output is:

"AA"    "Y"
"AA"    "\"
"AA"    "Y"
"AA"    "2"
"AA"    "\"
"AA"    "t"
"AA"    "a"
"AA"    "b"
"AA"    "l"
"AA"    "e"
"AA"    "_"
"AA"    "a"
"AA"    "a"
"AA"    "."
"AA"    "c"
"AA"    "s"
"AA"    "v"

which is again, not the output in the input CSV. Any thoughts?

Answer 1


You are not using root and dirs to create correct path.

for filename in [join(root, name)) for name in files)]:

you can find more description here

Answer 2


Your main problem is that you are passing the input file name to csv.reader, so that name string is being used as the data source. You need to actually open the input file and pass the file handle tocsv.reader.

Here's a simplified version of your code that ignores the os.walk stuff. It's a good idea to make sure your code processes one file correctly before you get fancy and try to process all files in a directory. Once you have the core process working correctly then you can expand it.

To test this code I created a very simple data file:

qdata

r0c0,r0c1,r0c2,r0c3
r1c0,r1c1,r1c2,r1c3
r2c0,r2c1,r2c2,r2c3
r3c0,r3c1,r3c2,r3c3
r4c0,r4c1,r4c2,r4c3

The code:

test.py

import csv
import os

filename = 'qdata'
symbol = ['AA']

i = 0
outname = str(i) + filename
while os.path.exists(outname):
    i += 1
    outname = str(i) + filename
print('Saving to name:', outname)    

with open(filename, 'rb') as infile:
    reader = csv.reader(infile)
    with open(outname, "wb") as outfile:
        writer = csv.writer(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
        for row in reader:
            row = symbol + row
            writer.writerow(row)

The output file:

0qdata

"AA"    "r0c0"  "r0c1"  "r0c2"  "r0c3"
"AA"    "r1c0"  "r1c1"  "r1c2"  "r1c3"
"AA"    "r2c0"  "r2c1"  "r2c2"  "r2c3"
"AA"    "r3c0"  "r3c1"  "r3c2"  "r3c3"
"AA"    "r4c0"  "r4c1"  "r4c2"  "r4c3"

Note that I used row = symbol + row to create a new row list. This is more efficient than modifying the list using .insert because the .insert method has to move all subsequent list items to make room for the inserted item.