[ can I use the csv module when the end of record marker is not a new line? ]
I want to parse a csv-like file which uses non-ascii delimiters. The csv module lets me set the quote character and the field delimiter. Is it possble to set the end of record delimiter so it can be used with the csv module?
Take a csv-like file where, instead of:
'"', ',', '\n'
it uses
'¦', '¶', '§'
for example
data = [
[1,r'''text "could" be
'tricky'\\'''],
[2,r'or easy']
]
would be represented as
'1¶¦text "could" be\n\'tricky\'\\\\¦§2¶¦or easy¦'
I know how to solve this using split etc. But is there a better way with the csv module?
This expression generates examples:
chr(167).join(
[
chr(182).join(
[
'\xa6{}\xa6'.format(val) if type(val)==str else str(val)
for val in row
]
) for row in data
])
Answer 1
No, you cannot directly use csv.reader()
for this, as the Dialect.lineterminator
parameter is hardcoded:
Note: The
reader
is hard-coded to recognise either'\r'
or'\n'
as end-of-line, and ignoreslineterminator
. This behavior may change in the future.
You'd have to create a wrapper around the reader to translate your line terminators:
class LineTerminatorTranslator(object):
def __init__(self, orig, terminator, buffer=2048):
self._orig = orig
self._terminator = terminator
self._buffer = buffer
def __iter__(self):
terminator = self._terminator
buffer = ''
if hasattr(self._orig, 'read'):
# read in chunks, rather than in lines, where possible
iterator = iter(lambda: self._orig.read(self._buffer), '')
else:
iterator = iter(self._orig)
while True:
try:
while terminator not in buffer:
buffer += next(iterator)
except StopIteration:
# done, yield remainder
yield buffer
return
entries, _, buffer = buffer.rpartition(terminator)
for entry in entries.split(terminator):
yield entry
This reads the input file in chunks of 2kb (configurable) and splits out the lines by the given line terminator.
Because csv.reader()
can handle any iterable, the code can accept other iterables too, but becomes less efficient if such an iterable produces large strings each iteration.
The code should work on both Python 2 and 3.
Demo:
>>> import csv
>>> import io
>>> sample = '1¶¦text "could" be\'tricky\n\'\\\\¦§2¶¦or easy¦'
>>> input = LineTerminatorTranslator(io.StringIO(sample), '§')
>>> list(csv.reader(input, delimiter='¶', quotechar='¦'))
[['1', 'text "could" be\'tricky\n\'\\\\'], ['2', 'or easy']]
Slightly contrived Python 2 version:
>>> import csv
>>> from cStringIO import StringIO
>>> sample = '1P|text "could" be\'tricky\n\'\\\\|T2P|or easy|'
>>> input = LineTerminatorTranslator(StringIO(sample), 'T')
>>> list(csv.reader(input, delimiter='P', quotechar='|'))
[['1', 'text "could" be\'tricky\n\'\\\\'], ['2', 'or easy']]
Answer 2
You can't read such files with the csv
module. There is an option called lineterminator
, but the documentation says:
The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.
You could apparently use this lineterminator
parameter to write such a file, but you wouldn't be able to read it back in using the csv
module.