I'm trying to do a nested grouping to organize my data like Grouping by User and then inner key campaign grouping and then inner key grouping Metric.
I have the following Dataframe structure.
+----------+--------+------+
|CampaignID|MetricID|UserID|
+----------+--------+------+
| 3| 1| 1|
| 4| 3| 3|
| 4| 2| 3|
| 3| 2| 2|
| 2| 3| 3|
+----------+--------+------+
I wrote the following code to
rdd = newDf.rdd
new = rdd.groupBy(lambda x: x["UserID"]).map(lambda x: (x[0], list(x[1])))
new.take(5)
Output:
[('1',
[Row(CampaignID='3', MetricID='1', UserID='1'),
Row(CampaignID='2', MetricID='1', UserID='1'),
Row(CampaignID='1', MetricID='3', UserID='1'),
)
]
Please note, I have 10k records. Right now I have grouped the data based on UserID. I'm trying to figure out how to further group it by CampaignID and then further grouping by MetricID. Then count the records with same metric id as shown in output as following json
[{
"UserID" : "1",
"data" : [{
"CampaignID" : "1",
"data" : [{
"MetricID" : "1",
"Count" : "5"
}]
}]
}]
There are multiple campaignID and multiple metrics. I'm thinking that first grouping and then reducing to see if same metric id exists to count the data. Any idea or code example would be useful.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…