TAGS :Viewed: 7 - Published at: a few seconds ago

[ How can I send a 4 byte header from Java and read it in Python? ]

I am trying to send a message via TCP sockets from a Java application and read it in Python 2.7 I want the first 4 bytes to specify the message length, so I could do:

header = socket.recv(4)
message_length = struct.unpack(">L",header)
message = socket.recv(message_length)

on the Python end.

Java side:

out = new PrintWriter(new BufferedWriter(new StreamWriter(socket.getOutputStream())),true);
byte[] bytes = ByteBuffer.allocate(4).putInt(message_length).array();
String header = new String(bytes, Charset.forName("UTF-8"));
String message_w_header = header.concat(message);

This works for some message lengths (10, 102 characters) but for others it fails (for example 1017 characters). In the case of failing value if I output the values of each bytes I get:

Bytes 0 0 3 -7
Length 1017
Hex string 3f9

Bytes 0 0 3 -17
Length 1007
Hex string \x00\x00\x03\xef

I think this has something to do with signed bytes in Java and unsigned in Python but I can't figure out what should I do to make it work.

Answer 1

The issue is on Java side -- b'\x03\xf9' is not valid utf-8 byte sequence:

>>> b'\x03\xf9'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 1: invalid start byte

It seems new String(bytes, Charset.forName("UTF-8")); uses 'replace' error handler b'\xef' is the first of three bytes of '\ufffd' Unicode replacement character encoded in utf-8:

>>> b'\x03\xf9'.decode('utf-8', 'replace').encode('utf-8')

that is why you receive b'\x03\xef' instead of b'\x03\xf9' in Python.

To fix it, send bytes in Java instead of Unicode text.

Unrelated, sock.recv(n) may return less than n bytes. If the socket is blocking; you could create a file-like object using file = sock.makefile('rb') and call file.read(n) to read exactly n bytes.