Skip to main content

Zeppelin Notebooks

It is possible to create a SPARK runtime with Apache Zeppelin enabled.

Follow the below are the steps:

  1. Go to Runtime and click the plus icon (mceclip0.png).
  2. Insert your configurations for the SPARK runtime:
    • Turn on the Zeppelin switch for your runtime (turn off "Termination on completion" so that you get more time to experiment with the notebook).
    • Ensure "Idle Time Deletion Interval In secs" is set, the default recommended setting is 600.
  3. Run any job from GCP QA with this created runtime. Example - BQ Loader job or an AA Loader job.
  4. When a cluster is spun up with your runtime, go to STEP LOGS.
  5. Click on details in the INFRASTRUCTURE CREATE CLUSTER step.  A window with Zeppelin URL should be displayed.
  6. Copy-paste the Zeppelin URL in your local browser.
  7. Access to all the data from the Zeppelin notebook is now available.

Available Interpreters

  • Spark (version 2+)
  • BigQuery
  • Python
  • .. many more

Configuring Zeppelin with BigQuery

BigQuery requires a minor set up of PROJECT_ID when creating a new Zeppelin notebook.

Using Zeppelin Notebook with Spark

There are no pre-requisites to use Spark with Zeppelin. When you create a new notebook, you can attach a Spark interpreter to it and start writing your Spark code.

Refer to the below step-by-step guide for using Spark runtime in your Zeppelin notebook:

ZeppelinNotebooks01.png

ZeppelinNotebooks02.png

Notice the URL in your browser - The last section of the URL has a unique code - for e.g 2ECQ4APT9 - this is the name of the folder where JSON format of the notebook is stored.

All notebooks are stored at "PROJECT-ID/syn-cluster-config/zeppelin-notebooks" PATH:

ZeppelinNotebooks03.png

ZeppelinNotebooks04.png

ZeppelinNotebooks05.png

spark.version // prints the spark version // res4: String = 2.3.3

ZeppelinNotebooks06.png

spark.sql(s``""``"show databases"``""``).show(``50``,``false``)

Show databases from your notebook:

ZeppelinNotebooks07.png

 Show tables in your database:

ZeppelinNotebooks08.png

 Look at the schema of one of the tables:

ZeppelinNotebooks09.png

 View records per day for this table:

ZeppelinNotebooks10.png

 Load DF and create a table in the database:

ZeppelinNotebooks11.png

 Now you can see your tableatb_metrics_sample in your database:

ZeppelinNotebooks12.png

 Let us save the data into this table:

ZeppelinNotebooks13.png

See records in this table:

ZeppelinNotebooks14.png

Features

  • Notebooks are saved to Google Cloud Storage.  This allows a user to open the notebook in the future and resume his/her work (All notebooks are stored at "PROJECT-ID/syn-cluster-config/zeppelin-notebooks" PATH).
  • All the Syntasa data that is usually accessible through Hive/Spark is accessible from within the Zeppelin Notebook (Time taken to run your piece of code from notebook will depend on the Spark runtime configurations that you create).
  • Similar to Jupyter, it allows you to run your code in blocks.

Note- This is currently in experimental mode. More developments to follow.