PySpark Spark SQL

1. Select Unique Records

from pyspark.sql import SparkSession

# Sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Alice", 25)]
df = spark.createDataFrame(data, ["id", "name", "age"])

df.createOrReplaceTempView("people")

# Select distinct names
unique_names_df = spark.sql(
                "SELECT DISTINCT name"
                "FROM people"
                )

# Show result
unique_names_df.show()

2. Count Records

# Count the number of rows in the DataFrame
count_df = spark.sql(
                "SELECT COUNT(*) as total_count"
                "FROM people"
                )

# Show result
count_df.show()

3. Group By and Count

# Group by name and count occurrences
group_by_count_df = spark.sql(
               "SELECT name, COUNT(*) as name_count"
               "FROM people"
               "GROUP BY name"
               )

# Show result
group_by_count_df.show()

4. Group By with Aggregation (SUM)

# Sample data
data = [("Alice", 1000), ("Bob", 1500), ("Alice", 2000)]
df2 = spark.createDataFrame(data, ["name", "salary"])

df2.createOrReplaceTempView("salaries")

# Group by name and sum salaries
sum_salaries_df = spark.sql(
                  "SELECT name, SUM(salary) as total_salary" 
                  "FROM salaries"
                  "GROUP BY name"
                  )

# Show result
sum_salaries_df.show()

5. Group By with Aggregation (Average)

# Group by name and calculate average salary
avg_salaries_df = spark.sql(
                  "SELECT name, AVG(salary) as avg_salary"
                  "FROM salaries"
                  "GROUP BY name"
                  )

# Show result
avg_salaries_df.show()

6. Filter Records with WHERE Clause

# Filter records where salary is greater than 1200
filter_df = spark.sql(
            "SELECT * "
            "FROM salaries"
            "WHERE salary > 1200"
            )

# Show result
filter_df.show()

7. Order Records by a Column

# Order records by salary in descending order
order_by_df = spark.sql(
              "SELECT * "
              "FROM salaries"
              "ORDER BY salary DESC"
              )

# Show result
order_by_df.show()

PySpark SQL Functions

1. Count the Number of Occurrences of Each Word

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col

# Sample data
data = [("Hello world",), ("Hello PySpark",), ("Spark is great",)]
df = spark.createDataFrame(data, ["text"])

# Split the text into words
words_df = df.select(explode(split(col("text"), " ")).alias("word"))

# Count occurrences of each word
word_count_df = words_df.groupBy("word").count()

# Show result
word_count_df.show()

2. Filter Data Based on a Condition

# Sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Cathy", 22)]
df = spark.createDataFrame(data, ["id", "name", "age"])

# Filter rows where age > 25
filtered_df = df.filter(col("age") > 25)

# Show result
filtered_df.show()

3. Join Two DataFrames

# Sample data for DataFrame 1
data1 = [(1, "Alice"), (2, "Bob"), (3, "Cathy")]
df1 = spark.createDataFrame(data1, ["id", "name"])

# Sample data for DataFrame 2
data2 = [(1, "HR"), (2, "Engineering"), (4, "Marketing")]
df2 = spark.createDataFrame(data2, ["id", "department"])

# Inner join on 'id'
joined_df = df1.join(df2, on="id", how="inner")

# Show result
joined_df.show()

4. Group By and Aggregate Data

from pyspark.sql.functions import avg

# Sample data
data = [("Alice", "HR", 25), ("Bob", "Engineering", 30), ("Cathy", "HR", 28)]
df = spark.createDataFrame(data, ["name", "department", "age"])

# Group by department and calculate average age
avg_age_df = df.groupBy("department").agg(avg("age").alias("avg_age"))

# Show result
avg_age_df.show()

5. Create a UDF (User Defined Function)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Sample data
data = [(1, "Alice"), (2, "Bob"), (3, "Cathy")]
df = spark.createDataFrame(data, ["id", "name"])

# Define a UDF to add a prefix to a name
def add_prefix(name):
    return "Mr./Ms. " + name

add_prefix_udf = udf(add_prefix, StringType())

# Apply the UDF
df_with_prefix = df.withColumn("name_with_prefix", add_prefix_udf(col("name")))

# Show result
df_with_prefix.show()

6. Handling Missing Data

# Sample data with missing values
data = [(1, "Alice", 25), (2, "Bob", None), (3, "Cathy", 28)]
df = spark.createDataFrame(data, ["id", "name", "age"])

# Fill missing values in 'age' with a default value of 0
filled_df = df.na.fill({"age": 0})

# Show result
filled_df.show()

7. Write Data to a CSV File

# Sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Cathy", 22)]
df = spark.createDataFrame(data, ["id", "name", "age"])

# Write DataFrame to CSV
df.write.csv("/path/to/output", header=True)

PySpark Read Write SQL

Introduction

In this guide, we will deep dive into how to manage data lakes on Databricks using SQL within PySpark. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.

Loading Data into Data Lake

Let's start by loading data into a data lake using PySpark SQL queries.

spark.sql("SELECT * FROM '/mnt/data/sample.csv'").show()

Data Transformation

After loading data into the data lake, you can perform transformations using SQL queries in PySpark. For example, let's perform a group by operation.

spark.sql("SELECT category, SUM(price) FROM sales GROUP BY category").show()

Writing Data Back to Data Lake

Once you have processed the data, you can write it back to your data lake in various formats such as Parquet using PySpark.

df_grouped.write.format("parquet").save("/mnt/data/output/")

PySpark SQL - Find Duplicate Records

This code finds duplicate records in the data based on a specific column using SQL commands in PySpark.

spark.sql("SELECT email, COUNT(email) FROM customers GROUP BY email HAVING COUNT(email) > 1").show()

PySpark SQL - Top Categories by Price

This code retrieves the top categories by the total price using SQL queries in PySpark.

spark.sql("SELECT category, SUM(price) FROM sales GROUP BY category ORDER BY SUM(price) DESC LIMIT 10").show()

PySpark Read Write DataFrame

Introduction

In this guide, we will deep dive into how to manage data lakes on Databricks using PySpark. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.

Loading Data into Data Lake

Let's start by loading data into a data lake using PySpark on Databricks.

df = spark.read.format("csv").option("header", "true").load("/mnt/data/sample.csv")
df.show()

Data Transformation

After loading data into the data lake, you can perform transformations using PySpark. For example, let's perform a group by operation.

df_grouped = df.groupBy("category").sum("price")
df_grouped.show()

Writing Data Back to Data Lake

Once you have processed the data, you can write it back to your data lake in various formats such as Parquet or Delta.

df_grouped.write.format("parquet").save("/mnt/data/output/")

PySpark - Find Duplicate Records

This code finds duplicate records in the data based on a specific column.

df_duplicates = df.groupBy("email").count().filter("count > 1")
df_duplicates.show()

PySpark - Top Categories by Price

This code retrieves the top categories by the total price, similar to how you might use SQL's GROUP BY.

df_top_categories = df.groupBy("category").sum("price").orderBy("sum(price)", ascending=False)
df_top_categories.show(10)