Spark Input Formats And Parallel factor

By
on Feb 12, 2017
in Spark

In Spark RDD/DataFrames have a parallel factor which they inherit from their parent RDD/DataFrame.In case of ReduceByKey, combineByKey it can be set explicitly too using coalesce() or repartition().

The Question arises how does spark decide the parallel factor of the first RDD ?

Spark Operation	File Input Format in Spark Code
Text File : javaSparkContext.textFile(inputPath);	hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions)
Sequence File : javaSparkContext.sequenceFile(inputPath)	val inputFormatClass = classOf[SequenceFileInputFormat[K, V]] hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)
WholetextFile: javaSparkContext.wholeTextFiles(inputPath)	new WholeTextFileRDD(this, classOf[WholeTextFileInputFormat], classOf[Text], classOf[Text], updateConf, minPartitions)
ParquetFile : sqlContext.read().parquet(inputPath)	ParquetFileFormat

The parallel factor of the first RDD/DataFrame is decide by the input split which is decided by input format used in the spark code . Each input function call (textFile, sequenceFile, parquet ) maps to appropriate InputFormat in the spark scala code

How to use a custom inputFormat in spark :

JavaRDD profileJavaRDD = sparkContext.newAPIHadoopFile(inputPath, CustomtInputFormat.class, KEY.class, VALUE.class, sparkContext.hadoopConfiguration())

Tags: inputFormat Spark, JavaRDD, JavaSparkContext, RDD, RDD parallel Factor, spark