TAGS :Viewed: 1 - Published at: a few seconds ago

[ can I use the csv module when the end of record marker is not a new line? ]

I want to parse a csv-like file which uses non-ascii delimiters. The csv module lets me set the quote character and the field delimiter. Is it possble to set the end of record delimiter so it can be used with the csv module?

Take a csv-like file where, instead of:

'"', ',', '\n'

it uses

'¦', '¶', '§'

for example

data = [
    [1,r'''text "could" be
    [2,r'or easy']

would be represented as

'1¶¦text "could" be\n\'tricky\'\\\\¦§2¶¦or easy¦'

I know how to solve this using split etc. But is there a better way with the csv module?

This expression generates examples:

                '\xa6{}\xa6'.format(val) if type(val)==str else str(val)
                for val in row
        ) for row in data

Answer 1

No, you cannot directly use csv.reader() for this, as the Dialect.lineterminator parameter is hardcoded:

Note: The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.

You'd have to create a wrapper around the reader to translate your line terminators:

class LineTerminatorTranslator(object):
    def __init__(self, orig, terminator, buffer=2048):
        self._orig = orig
        self._terminator = terminator
        self._buffer = buffer

    def __iter__(self):
        terminator = self._terminator
        buffer = ''

        if hasattr(self._orig, 'read'):
            # read in chunks, rather than in lines, where possible
            iterator = iter(lambda: self._orig.read(self._buffer), '')
            iterator = iter(self._orig)

        while True:
                while terminator not in buffer:
                    buffer += next(iterator)
            except StopIteration:
                # done, yield remainder
                yield buffer
            entries, _, buffer = buffer.rpartition(terminator)
            for entry in entries.split(terminator):
                yield entry

This reads the input file in chunks of 2kb (configurable) and splits out the lines by the given line terminator.

Because csv.reader() can handle any iterable, the code can accept other iterables too, but becomes less efficient if such an iterable produces large strings each iteration.

The code should work on both Python 2 and 3.


>>> import csv
>>> import io
>>> sample = '1¶¦text "could" be\'tricky\n\'\\\\¦§2¶¦or easy¦'
>>> input = LineTerminatorTranslator(io.StringIO(sample), '§')
>>> list(csv.reader(input, delimiter='¶', quotechar='¦'))
[['1', 'text "could" be\'tricky\n\'\\\\'], ['2', 'or easy']]

Slightly contrived Python 2 version:

>>> import csv
>>> from cStringIO import StringIO
>>> sample = '1P|text "could" be\'tricky\n\'\\\\|T2P|or easy|'
>>> input = LineTerminatorTranslator(StringIO(sample), 'T')
>>> list(csv.reader(input, delimiter='P', quotechar='|'))
[['1', 'text "could" be\'tricky\n\'\\\\'], ['2', 'or easy']]

Answer 2

You can't read such files with the csv module. There is an option called lineterminator, but the documentation says:

The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.

You could apparently use this lineterminator parameter to write such a file, but you wouldn't be able to read it back in using the csv module.