Getting Started with SparkSQL

This tutorial introduces Spark SQL programming within a JupyterLab environment (assuming Python is used). It demonstrates how to:

Query Hive Tables: Use Spark SQL queries to retrieve data from tables stored in Hive.
Filter and Aggregate Data: Filter results based on specific conditions and perform aggregations like counting occurrences within a partition range.
Work with Spark DataFrames: Store the queried data in Spark DataFrames, which offer powerful functionalities for manipulating and analyzing data.
Convert to Pandas DataFrames (Optional): Optionally, convert Spark DataFrames to Pandas DataFrames to utilize Pandas' extensive data analysis capabilities.

import pandas as pd

The below Spark SQL code queries a Hive table, counts occurrences within a date partition range, and stores the results (partition name and record count) in a Spark DataFrame for further analysis.

phrase_df = spark.sql(f"""   
SELECT event_partition, COUNT(*) records FROM DATASET_NANE.TABLE_NAME  
WHERE event_partition between '2024-01-01' and '2024-01-30'   
--GROUP BY event_partition order by 2 desc   
""")

The below code displays the contents of the Spark DataFrame phrase_df with the following modifications:

Number of Rows: It shows a maximum of 10 rows.
Truncation: It displays results without truncation.

phrase_df.show(10,False)

The below code snippet demonstrates how to group data by a specific column using a Pandas DataFrame (not Spark DataFrame) after retrieving data from Hive using Spark SQL:

group_df = phrase_df.groupBy("records").count()  
group_df.show()

Cache results to memory:

phrase_df.cache()

The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:

Download the sample notebook .ipynb file from this article.
Create a new notebook in your Syntasa environment using the import notebook option.