Spark Input Formats And Parallel factor
Spark Input Formats And Parallel factor
In Spark RDD/DataFrames have a parallel factor which they inherit from their parent RDD/DataFrame.In case of ReduceByKey, combineByKey it can be set explicitly too using coalesce() or repartition().
The Question arises how does spark decide the parallel factor of the first RDD ?
Spark Operation | File Input Format in Spark Code |
Text File : javaSparkContext.textFile(inputPath); | hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions) |
Sequence File : javaSparkContext.sequenceFile(inputPath) |
val inputFormatClass = classOf[SequenceFileInputFormat[K, V]] hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions) |
WholetextFile: javaSparkContext.wholeTextFiles(inputPath) |
new WholeTextFileRDD(this, classOf[WholeTextFileInputFormat], classOf[Text], classOf[Text], updateConf, minPartitions) |
ParquetFile : sqlContext.read().parquet(inputPath) | ParquetFileFormat |
The parallel factor of the first RDD/DataFrame is decide by the input split which is decided by input format used in the spark code . Each input function call (textFile, sequenceFile, parquet ) maps to appropriate InputFormat in the spark scala code
How to use a custom inputFormat in spark :
JavaRDD profileJavaRDD = sparkContext.newAPIHadoopFile(inputPath, CustomtInputFormat.class, KEY.class, VALUE.class, sparkContext.hadoopConfiguration())