amazon web services - AWS Glue Crawler Classifies json file as UNKNOWN

Question

Welcome To Ask or Share your Answers For Others

amazon web services - AWS Glue Crawler Classifies json file as UNKNOWN

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

amazon web services - AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

Has anyone else run into this issue? Is there a better way to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:32+0000

I have two json files which are 42mb and 16mb, partitioned on S3 as path:

s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json

I had the same problem as you, crawler classification as UNKNOWN.

I were able to solved it:

You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work

Categories

amazon web services - AWS Glue Crawler Classifies json file as UNKNOWN

amazon web services - AWS Glue Crawler Classifies json file as UNKNOWN

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags