开发者

how to aggregate rows based on a condition in pyspark?

开发者 https://www.devze.com 2022-12-07 21:37 出处:网络
I am trying to aggregate some rows in my pyspark dataframe based on a condition. Here is my dataframe:

I am trying to aggregate some rows in my pyspark dataframe based on a condition. Here is my dataframe:

customer total_purchased location
john 4 Maine
john 3 Nevada
john 5 null
Mary 4 Maine
Mary 4 Florida
Mary 4 null

The result I'm looking to get will look like this:

customer total_purchased location
john 9 Maine
john 8 Nevada
Mar开发者_C百科y 8 Maine
Mary 8 Florida

The rows that had a null location are removed, and the total_purchased from the rows with the null location is added to the total for each of the non-null locations.

Is there a way to do this in pyspark without involving very many steps?

I found a very interesting idea in this post (written by pault): Combine two rows in Pyspark if a condition is met

But I wasn't able to implement it because there isn't a column to group by so easily in this scenario.


Filter all null records, group by "customer" and sum the "total_purchased". This way it will work even if you have multiple null records for a customer.

Then filter all not-null records, join above dataframe and sum the "total_purchased" with joined entry.

df = spark.createDataFrame(data=[["john",4,"Maine"],["john",3,"Nevada"],["john",5,None],["Mary",4,"Maine"],["Mary",4,"Florida"],["Mary",4,None]], schema=["customer","total_purchased","location"])

df_null = df.filter(F.col("location").isNull()) \
            .groupBy("customer") \
            .agg(F.sum("total_purchased").alias("total_for_null"))

df = df.filter(F.col("location").isNotNull()) \
       .join(df_null, on="customer", how="left") \
       .withColumn("total_purchased", F.col("total_purchased") + F.col("total_for_null")) \
       .drop("total_for_null")

Output:

+--------+---------------+--------+
|customer|total_purchased|location|
+--------+---------------+--------+
|Mary    |8              |Maine   |
|Mary    |8              |Florida |
|john    |9              |Maine   |
|john    |8              |Nevada  |
+--------+---------------+--------+
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号