Understanding byte encoding and decoding in Java
Understanding encoding in details and various encoding mechanisms.
In very simple terms encoding is the process of converting a information in a format which makes it efficient for storage or transmission without loosing the information contained in it.Lets take a deeper look into it and try to understand the mystery around encoding and decoding .
Lets take a human perspective on language,
- Sound have been encoded and decoded to represent certain information.
- Same sound if taken out of context or in a different language will mean something else.
- The schema and semantics of Sound encoding and decoding is with the humans and humans have consistently maintained the schema and symantics from one generation to other.
To be more specific .
KAPUT in GERMAN means “broken” while in Hindi it means a “evil son”. Its the same word, same pronunciation still the meaning changes, as GERMAN and HINDI are different languages and schema to extract out the information from the same sound is different.
Lets put the same analogy in computers, in a computer system everything is represented at sequence of 1 and 0s. How does one represent alphabets , numbers and special character as the underlying storage format is just 1 and 0s .
Human Language | Computer |
Sound | Sequence of 1 and 0s |
based on a specific sound sequence alphabets and word are formed | Based on a specific sequence alphabets words are formed |
Humans learn and maintain the consistent semantics about the language and are able to communicate | Various standards like ASCII , UTF-8 , UTF-16 defines the semantics and every programming language choose the standard they want to follow. |
A language interpreter is required to convert from one language to other | A encoder decoder is required to convert from one standard to other. |
Lets take a Byte and understand what in the raw forms it mean.
- In raw form Byte is a sequence of 8 bits with each bit can take value of 0 or 1.
- The same byte can be treated as a char or short.
- byte implicitly has no information about how its going to be treated.
- Same byte(binary sequence) if treated as a char will reflect something and something else if treated as int.
- It is based on the interpretation of the byte of what information its going to convey.
ASCII (American Standard code for Information Interchange) is 7 bit character encoding mechanism thus providing mapping between 2 power 7 = 128 unique sequence (1 and 0) to represent symbols (a-z, A-Z, 0-9 and all other character).
As 128 unique sequences are not sufficient to include characters of all world languages UTF-8, UTF-16 and UTF-32 encoding schemes were also adopted.
[addToAppearHere]
Key Take Aways.
- Encoding holds the information of how to represent a sequence of 1 and 0s to which specific character.
- There is no direct relation between the represented characters and decimal numbers.
- mapping was from sequence of 1 and 0 (00110000 => 0) and decimal representation of 00110000 is 48 .
- Do understand humans have been trained to think in terms of decimal numbers hence we convert 00110000 to 48 and say 48 is ASCII integer for 0.
- This is associative relationship and not direct, this has more to do with how we think.
- Do remember decimal number, octal number, hex number are themselves a kind of encoding over the binary sequences.
- On the same Logic the whole idea of File System also exist where the schema of how to interpret the binary data from hard disk stays with the OS and hard disk being completely agnostic of any kind of file, character word, lines sentences.
package com.big.data.java.samples;
import java.nio.ByteBuffer;
import java.util.stream.IntStream;
public class EncodeDecode {
public static void main(String[] args) {
// In Java character is of two bytes but not the chars in the String
// Depending on the encoding settings of the JVM which is UTF-8 String occupies one Byte
// "01010101" String length is 8 bytes and not 16 bytes
String characterRepresentation = "01010101";
// Raw bytes , the schema is no more relvent
// Raw bytes are sequence of 1 and 0
ByteBuffer rawBytes = ByteBuffer.wrap(characterRepresentation.getBytes());
// Char is 2 bytes
// hence on each call its consuming 2 bytes hence only 4 iteration required
IntStream.range(0, 4)
.forEach(value ->
System.out.println("Interprete as CHAR value is " + rawBytes.getChar()));
// rewind the ByteBuffer internal pointer
rawBytes.rewind();
IntStream.range(0, 8)
.forEach(value ->
System.out.println("Interprete as BYTE value is " + rawBytes.get()));
// rewind the ByteBuffer internal pointer
rawBytes.rewind();
// Short is 2 bytes
// hence on each call its consuming 2 bytes hence only 4 iteration required
IntStream.range(0, 4)
.forEach(value ->
System.out.println("Interprete as SHORT value is " + rawBytes.getShort()));
// rewind the ByteBuffer internal pointer
rawBytes.rewind();
// rawBytes is of length 4 hence we can interprete it as Int
// In java Int is 4 bytes
IntStream.range(0, 2)
.forEach(value ->
System.out.println("Interprete as SHORT value is " + rawBytes.getInt()));
Key Take Aways
CODE | INTERPRETATION |
String characterRepresentation = “01010101”; | “01010101” is of length 8 and 0 and 1 here are visual representation of the UTF-8 characters |
String 0 and 1 | String 0 =>00110000 = 48 decimal ,
String 1 =>00110001 = 49 decimal |
ByteBuffer rawBytes =
ByteBuffer.wrap(characterRepresentation.getBytes());
|
covert characterRepresentation to raw bytes encoding erasure has happened |
rawBytes.getChar() | As char is of 2 bytes, it consumes
2 bytes at a time from the byteBuffer. In reality getChar() is reading “01” and in terms of byteSequence its consumig 0011000000110001 which is represented in character as 〱 a loop from 0<= i <4 has been used as in 4 loops all 8 bytes will be consumed |
rawBytes.get() |
reads 1 bytes at a time and converting them into a 1 Byte int (-127 to 128)
String 0 =>00110000 = 48 decimal , String 1 =>00110001 = 49 decimal a loop from 0<= i <8 has been used as in 8 loops all 8 bytes will be consumed |
rawBytes.getShort() |
As Short is of 2 bytes, it consumes
2 bytes at a time from the byteBuffer. In reality its reading “01” in terms of byteSequence its consuming 0011000000110001 when converted to decimal number is 12337. a loop from 0<= i <4 has been used as in 4 loops all 8 bytes will be consumed |
rawBytes.getInt() | As Integer is of 4 bytes, it consumes
4 bytes at a time from the byteBuffer. In reality its reading “0101” in terms of byteSequence its consuming 011000000110001011000000110001 which when converted to decimal number is 808529969. . a loop from 0<= i <2 has been used as in 2 loops all 8 bytes will be consumed |
[addToAppearHere]
OUTPUT
Interprete as CHAR value is 〱
Interprete as CHAR value is 〱
Interprete as CHAR value is 〱
Interprete as CHAR value is 〱
Interprete as BYTE value is 48
Interprete as BYTE value is 49
Interprete as BYTE value is 48
Interprete as BYTE value is 49
Interprete as BYTE value is 48
Interprete as BYTE value is 49
Interprete as BYTE value is 48
Interprete as BYTE value is 49
Interprete as SHORT value is 12337
Interprete as SHORT value is 12337
Interprete as SHORT value is 12337
Interprete as SHORT value is 12337
Interprete as INT value is 808529969
Interprete as INT value is 808529969