Buffers and data representation

Lecture 1

Basic concept

Everything that is exchanged through the network is a sequence of bytes which has no meaning in the absolute. --- Matrix

As the different machines that communicate have different processors, OS and programming languages, there is no common ground.

It is the protocol that fixes the meaning of the bytes.

Example

What do the bytes 61 E2 82 AC represent (in hexadecimal) ?


  • A single number or several numbers : size in bytes ?
  • Endianness, signed vs unsigned ?
  • One or several strings : charset, size?
  • An image ...

61 E2 82 AC

Size: short (2 bytes), int (4 bytes) et long (8 bytes)

Endianness

  • BigEndian: most significant byte first
    61 E2 82 AC → 1642234540
  • LittleEndian: least significant byte first
  • AC 82 E2 61 → 2894258785

Signed vs Unsigned

AC 82 E2 61 (unsigned) → 2894258785

AC 82 E2 61 (signed) → -1400708511

Hexadecimal, signed vs unsigned

At OS or programmation level, we consider byte (8 bits) as the atomic element.

Hexadecimal :        61          E2          82          AC
Binary      : 0110 0001   1110 0010   1000 0010   1010 1100
Decimal     :        97         226         130         172

When signed, integers are represented in Two's complements (0 is only positive)

               Unsigned              Signed
Hexadecimal :        AC                  AC                          
Binary      : 1010 1100           1010 1100
Decimal     :       172                 -84

https://en.wikipedia.org/wiki/Signed_number_representations#Two's_complement

Endianness

  • BigEndian: most significant bit first (stored at the smallest index in memory)
  • LittleEndian: least significant bit first (stored at the smallest index in memory)

https://en.wikipedia.org/wiki/Endianness

61 E2 82 AC

A string ?

In ASCII (7 bits), only the first by is valid: "a???"

In ISO-8859-1, these bytes represent "aâ ¬"

In UTF-8, these bytes represent "a€"

Charset

A character is a symbol which in Java is represented by the primitive type char.

A charset is a mapping between characters and sequences of bytes.

There exist numerous charsets (ASCII, UTF-8, ISO-8859-1...)

61 E2 82 AC

To go from bytes to characters, we need a charset.

In ASCII (7 bits), ISO-8859-1 or UTF-8 61 represent 'a'

In ASCII (7 bits): E2 does not correspond to any character.

In ISO-8859-1: E2 represent 'â'

In UTF-8: E2 is not associated to any character.

But in UTF-8: E2 82 AC represents '€'

In Java

Historically, Java manipulated sequences of bytes using byte arrays byte[].

Performance problem : the internal representation of byte[] is fixed by the language specification.

Solved in java.nio by using ByteBuffer which allow for more efficient implementations.

Throughout this course, we forbid the use of byte[].

java.nio: the new inputs/outputs (1.4)

Possibility of handling memory outside the Garbage Collector (performance)

  • Streams (1) are replaced by Channel objects
  • Use of ByteBuffer instead of byte[]
  • Charset objects are introduced to represent charsets.
  • Abstract classes are used to hide platform-dependent implementations.

    (1): here, Streams means raw data (bytes) streams and not for java.util.stream.Stream

    java.nio.ByteBuffer

    Essentially an array of bytes:
    a sequence of bytes of fixed size (capacity)

    Notion of work-zone between two indices

  • position = first index in the zone
  • limit = first index outside of the zone
  • Creating a ByteBuffer

    ByteBuffer bb = ByteBuffer.allocate(1024);
    

    The factory method ByteBuffer.allocate(int capacity) creates a ByteBuffer of size capacity.
    The postion is set to 0 and the limit to the capacity.
    This object is managed by Java's Garbage Collector.

    ByteBuffer bb = ByteBuffer.allocateDirect(1024);
    

    Same result as the method above except the object is not handled by the Garbage Collector, but rather by the system.
    IO are more efficient but allocation/deallocation are much slower.

    Dogma: We reserve allocateDirect() for ByteBuffer that are used throughout the whole duration of the program.

    Acessing a ByteBuffer

    Access is relative to the current position:

  • put(b) write a b at the current position,
  • get() reads and returns the byte at the current position.
  • both calls increase the current position by one (as in a stream)
    ⇒ it reduces by one the work-zone.

    If the work-zone is empty, an exception is raised:

    BufferOverflowException or BufferUnderflowException

    Read-mode or write-mode

    Conceptually, a buffer is:

  • either in write-mode, i.e. prepared to be filled by put() calls
    ⇒ its work-zone contains useless data that will be overwritten
  • or in read-mode, i.e. prepared to be consumed by get() calls
    ⇒ its work-zone contains useful data to be read
  • You must carefully use specific methods to change a buffer's mode

    Using a ByteBuffer

    An allocated buffer is in write-mode:

    Flip

    To switch from write-mode to read-mode:
    limit := position and position := 0

    Compact

    To switch in to write-mode (and add new bytes at the end of the current work-zone without losing those not yet read):

    Other useful methods

  • remaining() returns the size of the work-zone
  • hasRemaining() returns true if the work-zone is non-empty
  • position() gives the value of the position index
  • position(int pos) sets the position index
  • limit() gives the limit index
  • limit(int pos) sets the limit index
  • clear() sets position to 0 and limit to capacity
  • Methods for primitive types

  • putInt() writes the 4 bytes of an int at the beginning of the work-zone and reduces it
  • getInt() reads the 4 bytes of an int at the beginning of the zone and reduces it
  • Similarly for putLong() and getLong() for the 8 bytes of a long.
  • And putShort() and getShort() for the 2 bytes of a short.
  • Question: which bytes does bb.putInt(1) write in bb ?

    Endianess

    The byte-order in memory for shorts,ints and longs can be:

  • Big Endian: most significant byte first -- also called Network Order
  • Little Endian: least significant byte first
  • java.nio.ByteOrder

    ByteOrder.nativeOrder() give the native byte-order of the plateform.

    By default, the order of a ByteBuffer is BigEndian but it can be modified using order(ByteOrder)

    Encoding and decoding

    A charset gives a code (over one or several bytes) for each character in this charset.

    Encoding translates a sequence of characters into a sequence of bytes.

    Decoding is the inverse operation, going from a sequence of bytes to a sequence of characters.

    Obviously, encoding/decoding only have meaning relative to a charset.

    java.nio.charset.Charset

    Represents a set of characters

    Charset charset = Charset.forName("UTF-8"); or
    Charset charset = StandardCharsets.UTF_8;

    It provides simple methods to encode and decode.

  • ByteBuffer bb = charset.encode(String s)
  • CharBuffer cb = charset.decode(ByteBuffer bb)
  • In this course, we only alllow these two methods.

    More efficient methods can be accessed via CharsetEncoder and CharsetDecoder

    Example of decoding

    Depending on the charset, a character may be encoded by one, two, three... bytes.

    We must be sure of having all of the bytes before decoding.

    FileChannel

    A FileChannel allows to read and write raw bytes from/to a file.

    We are going to use them to play with ByteBuffer before introducing UDP and TCP.

    Path path = Paths.get("~/test.txt");
    // open in read-mode
    try (FileChannel fc = FileChannel.open(path, StandardOpenOption.READ)) {
     					    ....
    }
    // open in write-mode
    try(FileChannel fc = FileChannel.open(path,
    				     StandardOpenOption.CREATE,
    				     StandardOpenOption.WRITE,
    				     StandardOpenOption.TRUNCATE_EXISTING)){
    				         ....
    }
                    

    Reading

    int fc.read(ByteBuffer bb) reads at most bb.remaining() bytes from the channel fc and stores them in the buffer bb

    read() returns the number of bytes read, or -1 if the channel is closed.

    Even if the file contains more bytes than the buffer work-zone, there is no guarant that the buffer will be completely filled by a single call.

    By default, reading is blocking, i.e., read() blocks until at least one byte is read.

    Writing

    int fc.write(ByteBuffer bb) writes bb.remaining() bytes in the channel fc, taken from the work-zone of the buffer bb

    By default, writing is blocking, i.e., write() returns when all bytes have been written (it returns this number).

    Example (1/3)

    We want to write a program that:

    • takes a filename as argument,
    • while there is input: reads integers (int) from the keyboard and writes the corresponding 4 bytes in the file in BigEndian.

    To simplify, we will first write these int in the file as soon as they are read.

    NB: not efficient, write should be grouped

    Example (2/3)

    Path path = Paths.get(args[1]);
    ByteBuffer buff = ByteBuffer.allocate(Integer.BYTES);  // 4 bytes
    try(FileChannel fc = FileChannel.open(path, StandardOpenOption.CREATE, 
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING);
            Scanner scan = new Scanner(System.in)) {
        while (scan.hasNextInt()) {
            buff.putInt(scan.nextInt());
            buff.flip();
            fc.write(buff);
            buff.clear();
        }
    }
    

    What do we do with the IOException raised by FileChannel.open and FileChannel.write?

    Example (3/3)

    More efficient, we take a large ByteBuffer that we fill with the int. When the buffer is full, we write it to the file before going on.

    Path path = Paths.get(args[1]);
    ByteBuffer buff = ByteBuffer.allocate(BUFFER_SIZE);
        
    try(FileChannel fc = FileChannel.open(path, StandardOpenOption.CREATE, 
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING);
            Scanner scan = new Scanner(System.in)) {
        while (scan.hasNextInt()) {
            if (buff.remaining()<Integer.BYTES){
                buff.flip();
                fc.write(buff);
                buff.clear();
            }
            buff.putInt(scan.nextInt());
        }
        buff.flip();
        fc.write(buff);
    }