Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
279 views
in Technique[技术] by (71.8m points)

Continuously reading all files in s3 bucket with spark

I'm fairly new to spark, and I'm not sure i understand something, consider I've the following code to read all files in the bucket_name/2021/01/12/ folder into spark:

val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("s3://bucket_name/2021/01/12/*.csv")
  1. Will all files will be loaded to memory? What if i have millions of files there and i will OOO?
  2. How does spark manages not to read the same file twice?
  3. What if during the run of the code, new files added to the folder, will spark pick them up?
  4. I didn't find anywhere in the docs, but let's say my app crashed while reading the folder, is there any way to save a pointer and restart from the last file available? (Considering i can't implement any message brokers and must read from S3)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...