I'm fairly new to spark, and I'm not sure i understand something,
consider I've the following code to read all files in the bucket_name/2021/01/12/ folder into spark:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("s3://bucket_name/2021/01/12/*.csv")
- Will all files will be loaded to memory? What if i have millions of files there and i will OOO?
- How does spark manages not to read the same file twice?
- What if during the run of the code, new files added to the folder, will spark pick them up?
- I didn't find anywhere in the docs, but let's say my app crashed while reading the folder, is there any way to save a pointer and restart from the last file available? (Considering i can't implement any message brokers and must read from S3)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…