PySpark Challenge: Analyze Transactions by Month and Country

🚀 PySpark Challenge for Data Engineers! You are given a dataset transactions_df with the following id country state amount trans_date id is the primary key. The state column is an enum type having values as [“approved”, “declined”] 🔹 Task: Using PySpark, write a query to find for each month and country, the no. of transactions and their total amount, the number of approved transactions and their total amount. 📊 Example: transactions_df id | country | state | amount | trans_date 121 | US | approved | 1000 | 2018-12-18 122 | US | declined | 2000 | 2018-12-19 123 | US | approved | 2000 | 2019-01-01 124 | DE | approved | 2000 | 2019-01-07 ✅ Expected Output: month | country | trans_count | approved_count | trans_total_amount | approved_total_amount 2018-12 | US | 2 | 1 | 3000 | 1000 2019-01 | US | 1 | 1 | 2000 | 2000 2019-01 | DE | 1 | 1 | 2000 | 2000 Comment down your approach and follow for more #interviewpreparation #interview #pyspark #dataengineer #jobinterview

from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum as _sum, count, when, date_format # Create Spark session spark = SparkSession.builder.appName("TransactionAnalysis").getOrCreate() # Example data data = [ (121, "US", "approved", 1000, "2018-12-18"), (122, "US", "declined", 2000, "2018-12-19"), (123, "US", "approved", 2000, "2019-01-01"), (124, "DE", "approved", 2000, "2019-01-07"), ] columns = ["id", "country", "state", "amount", "trans_date"] transactions_df = spark.createDataFrame(data, columns) # Extract month from trans_date transactions_df = transactions_df.withColumn("month", date_format(col("trans_date"), "yyyy-MM")) # Group and aggregate result_df = transactions_df.groupBy("month", "country").agg( count("*").alias("trans_count"), _sum("amount").alias("trans_total_amount"), _sum(when(col("state") == "approved", 1).otherwise(0)).alias("approved_count"), _sum(when(col("state") == "approved", col("amount")).otherwise(0)).alias("approved_total_amount") ) # Show result result_df.show(truncate=False)

Like
Reply

To view or add a comment, sign in

Explore content categories