Understanding the difference between Binary and text files using Java.

Understanding the difference between Binary and Text Files

Lets start with the basic ideas.

  • A file is saved on Hard-disk.
  • A hard-disk only has two stated (magnetized demagnetized) which represent the two states of binary.
  • Any file saved on hard-disk has to follow the semantics of the two states. (magnetized demagnetized)
  • Hence any file saved on hard-disk is binary in nature.
  • Secondary storage just stores an array of bits. How to encode or decode the (8 bits) byte is not known to hard-disk.
  • Letter words and symbols are unknown to the secondary storage.
  • It is we that take a  8 bits sequence and map it to a symbol (ASCII, UTF-8, UTF-16)
  • The secondary storage has no idea about how to encode the bytes hence all the data stored (files, pics, videos) are binary.

 

Key Take Aways

  • All files are binary in nature.
  • Text file is a special case of binary file, where we encode the data in a way that it is human readable (when read through a text editor)
  • A text file saved on secondary storage is store in binary form.
  • Text files is like a protocol that one follows so that the writer writes in a specific way (encode), the text reader process decodes it in a way, that its become human readable.

[addToAppearHere]

Lets take an example and try to understand in terms of byte and its interpretation whats happening .

 

Text file format : Lets take a String :

String characterRepresentation = "01010101";

If the String “01010101” needs to be saved in text format . What it means is

  • Save it in a way that when the text editor reads it , it should be able to display 01010101.

UTF-8 representation of 0 and 1 is
String 0 =>00110000 = 48 decimal
String 1 =>00110001 = 49 decimal

  • Hence we take the binary sequence for 0 and 1 and save it.
  • The bits sequence saved on disk will be
    : 0011000000110001001100000011000100110000001100010011000000110001
  • The text reader , reads when given the file to read , reads one byte(8 bits) sequentially and does a
    lookup in the UTF-8 encoding table to find which symbol reflect the bits sequence and hence displays 01010101.
  • Hence text format like any other protocol where in the reader and write have predefined way of reading and
    writing data so that the final outcome is human readable, as the lookup is being done from UTF-8 sequence.
  • If one needs to save 0 (irrespective of it is a char or int) in a text file It will need one Byte to save it.
  • Every thing is treated as a symbol in text file, from the symbol based on the sequence it is mapped to
    using UTF-8 encoding the bits sequence is saved.  Integer 12 will be treated as String 12.
  • cat , less, more all treat bytes in text format, hence they end up showing unreadable character if a binary file is opened.

 

Binary file format :  (this is a misnomer , all files by default is binary).
What it wants to say is save the data in a way based on the type.

  • 0 integer will be saved in 4 bytes as int is of 4 bytes in Java.
  • 0 integer( Base10 number) will changed to Base 2 format and this sequence of bits will be saved, hence it will occupy 4 bits.
  • 0 character will be saved by looking up in UTF-8 and the bits sequence will be saved.
  • While reading the binary file, the schema which was used to write the data, needs to be present as
    there is no information present on how to decode the bytes.
  • Do remember if you have two bytes then can be decoded as char or short in Java.
  • The interpretation is based on the schema with which you have written.
  • The binary file can be read by a text editor, it will start reading 1 Byte sequentially  and will perform lookup from UTF-8 table and display it.
  • Int 1 in Binary will be saved as 00000000000000000000000000000001.
  • If the binary of int 1 is read by a text editor it will be displayed as 0001.