Skip to main content

Getting Started with SparkSQL

This tutorial introduces Spark SQL programming within a JupyterLab environment (assuming Python is used). It demonstrates how to:

  • Query Hive Tables: Use Spark SQL queries to retrieve data from tables stored in Hive.
  • Filter and Aggregate Data: Filter results based on specific conditions and perform aggregations like counting occurrences within a partition range.
  • Work with Spark DataFrames: Store the queried data in Spark DataFrames, which offer powerful functionalities for manipulating and analyzing data.
  • Convert to Pandas DataFrames (Optional): Optionally, convert Spark DataFrames to Pandas DataFrames to utilize Pandas' extensive data analysis capabilities.
import pandas as pd

The below Spark SQL code queries a Hive table, counts occurrences within a date partition range, and stores the results (partition name and record count) in a Spark DataFrame for further analysis.

phrase_df = spark.sql(f"""   
SELECT event_partition, COUNT(*) records FROM DATASET_NANE.TABLE_NAME
WHERE event_partition between '2024-01-01' and '2024-01-30'
--GROUP BY event_partition order by 2 desc
""")

The below code displays the contents of the Spark DataFrame phrase_df with the following modifications:

Number of Rows: It shows a maximum of 10 rows.
Truncation: It displays results without truncation.

phrase_df.show(10,False)

The below code snippet demonstrates how to group data by a specific column using a Pandas DataFrame (not Spark DataFrame) after retrieving data from Hive using Spark SQL:

group_df = phrase_df.groupBy("records").count()  
group_df.show()

Cache results to memory:

phrase_df.cache()

The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:

  1. Download the sample notebook .ipynb file from this article.
  2. Create a new notebook in your Syntasa environment using the import notebook option.