Understanding Spark Serialization

Understanding Spark Serialization , and in the process try to understand when to use lambada function , static,anonymous class and transient references

Before you start with understanding Spark Serialization, please go through the link

 

The Key take away from the link are :

  • Spark follows Java serialization rules, hence no magic is happening.
  • Unlike Map Reduce code where you had separate classes for Mapper, Reducer, Driver . In Spark you just have one class which incorporated the logic of the complete ETL(Map, Reduce Task ) and the driver.
  • The Spark class is the driver hence all the code you see is executed on driver, hence all object instantiation happens on driver. The serialized objects are sent to Executors to work as Task.
  • All Lambda/Anonymous/Static class used with the transformation are instantiated on Driver , serialized and sent to the driver.

 

Some Java Facts

  • Any anonymous class defined inside a outer class has reference to the outer class.
  • If the anonymous class needs to be serialized it will compel you to make the outer class serialized.
  • Inside the lambda function if one uses a method of the enclosing class , the class needs to be serialized , if the lambda function is being serialized.

[addToAppearHere]

Some Facts about Spark.

  • On Same Executor multiple tasks can run at the same time in the same JVM as Tasks are spawned as threads in spark.
  • Any lambda, Anonymous Class used with the spark Transformation function (map, mapPartitions, keyBy , redudeByKey …) will be instantiated on driver, serialized and sent to the executor.
  • To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object.
  • A Java object is serializable if its class or any of its super class implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable.

A class is never serialized only object of a class is serialized . Object serialization is needed if object needs to be persisted or transmitted over the network .

Class Component Serialization
instance variable yes
Static instance variable no
methods no
Static methods no
Static inner class no
local variables no
Transient variables no

 

Accessibility and Serializability of instance variable from Outer Class inside inner class objects

 

Inner
class
Instance
Variable
(Outer class)
Static
Instance
Variable
(Outer class)
Local
Variable
(Outer class)
Anonymous
class
Accessible
 And
 Serialized
Accessible
and  
not Serialized
Accessible
And
Serialized
Inner Static
 class
Not
Accessible
Accessible
 yet
not Serialized
Not
 Accessible

[addToAppearHere]

Rule Of Thumb :

  • Avoid using anonymous class , instead use static classes as anonymous class will force you to have the outer class serialized.
  • Avoid using static variables as a work around for serialization issue , as Multiple Task can run inside the same JVM and the static instance might not be thread safe.
  • Use Transient variables to avoid serialization issue , you will have to initialize them inside the function call and not Constructor. As on driver the constructor will be called , on Executor it will de-serialize and for the object . only way to initialize is inside the function call .
  • Use Static class in place of anonymous class.
  • Religiously follow ” attaching implements Serializable ” only for the classes which only needs to be serialized
  • Inside a “lambda function” never refer to outclass method directly , as this will leas to serialization of outer class.
  • Make methods static if it needs to be used within Lambda function directly , else use Class::func() notion but not func() directly
  • Java Map<> doesn’t implement Serializable but HashMap does .
  • Be wise when deciding over using Braodcast vs Raw DataStructures. If you see a real benefit then only use Broadcast.