Spark Input Formats And Parallel factor

Spark Input Formats And Parallel factor

In Spark RDD/DataFrames have a parallel factor which they inherit from their parent RDD/DataFrame.In case of ReduceByKey, combineByKey it can be set explicitly too using coalesce() or repartition().

 

The Question arises how does spark decide the parallel factor of the first RDD ?

 

Spark Operation                 File Input Format  in Spark  Code
      Text File      :       javaSparkContext.textFile(inputPath);     hadoopFile(path, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], minPartitions)

Sequence File :

    javaSparkContext.sequenceFile(inputPath)

  val inputFormatClass = classOf[SequenceFileInputFormat[K, V]]

   hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)

  WholetextFile:

  javaSparkContext.wholeTextFiles(inputPath)

   new WholeTextFileRDD(this, classOf[WholeTextFileInputFormat],   classOf[Text], classOf[Text], updateConf, minPartitions)
ParquetFile :  sqlContext.read().parquet(inputPath)     ParquetFileFormat

 

The parallel factor of the first RDD/DataFrame is decide by the input split which is decided by input format used in the spark code . Each input function call (textFile, sequenceFile, parquet ) maps to appropriate InputFormat in the spark scala code

How to use a custom inputFormat in spark :

JavaRDD profileJavaRDD = sparkContext.newAPIHadoopFile(inputPath, CustomtInputFormat.class, KEY.class, VALUE.class, sparkContext.hadoopConfiguration())