scala - How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

Question

Welcome To Ask or Share your Answers For Others

scala - How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

My understanding of Spark's fileStream() method is that it takes three types as parameters: Key, Value, and Format. In case of text files, the appropriate types are: LongWritable, Text, and TextInputFormat.

First, I want to understand the nature of these types. Intuitively, I would guess that the Key in this case is the line number of the file, and the Value is the text on that line. So, in the following example of a text file:

Hello
Test
Another Test

The first row of the DStream would have a Key of 1 (0?) and a Value of Hello.

Is this correct?

Second part of my question: I looked at the decompiled implementation of ParquetInputFormat and I noticed something curious:

public class ParquetInputFormat<T>
       extends FileInputFormat<Void, T> {
//...

public class TextInputFormat
       extends FileInputFormat<LongWritable, Text>
       implements JobConfigurable {
//...

TextInputFormat extends FileInputFormat of types LongWritable and Text, whereas ParquetInputFormat extends the same class of types Void and T.

Does this mean that I must create a Value class to hold an entire row of my parquet data, and then pass the types <Void, MyClass, ParquetInputFormat<MyClass>> to ssc.fileStream()?

If so, how should I implement MyClass?

EDIT 1: I have noticed a readSupportClass which is to be passed to ParquetInputFormat objects. What kind of class is this, and how is it used to parse the parquet file? Is there some documentation that covers this?

EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:36:23+0000

My sample to read parquet files in Spark Streaming is below.

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
  directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)

val lines = stream.map(row => {
  println("row:" + row.toString())
  row
})

Some points are ...

record type is GenericRecord
readSupportClass is AvroReadSupport
pass Configuration to fileStream
set parquet.read.support.class to the Configuration

I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

Categories

scala - How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

scala - How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags