A brief overview of Apache's Spark Runtime Architecture - Spark's Driver Program in a Nutshell. PySpark is built on top of Spark's Java API. Data is processed in Python and Cached/shuffled in the Java Virtual Machine(JVM). The Python driver communicates with a local (JVM) running within the Apache Spark Framework over an associated gateway (Py4j), and that gateway is linked to the JVM.

PySpark Python Driver Program is an interactive Python Spark Shell that accesses Spark's API during the lifespan of a Spark Application via the start of a Spark console or during the execution of Microsoft's Azure Databricks notebook. The PySpark shell initiates a Py4j java gateway which is the port that the SparkContext uses to communicate to Java's JVM. The SparkContext uses Py4j to launch a JVM and creates a JavaSparkContext. By default, PySpark has a SparkContext available as 'sc.' Py4j is only used on the driver for local communication between the Python and JavaSparkContext objects that communicates between the Python driver and the local Spark JVM process. The SparkContext uses Py4j that enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. So, every SparkContext has an associated Py4j gateway that is linked to a local JVM of the Driver program.

PySpark - SparkContext Default Parameters Relevant to this Blog's objective: 
        a. Param gateway: Py4j Gateway instance reference variable _gateway 
        b. Param jvm: JVM instance reference variable _jvm 
        c. Param jsc: JavaSparkContext instance reference variable _jsc 

To access the Py4j gateway use sc._gateway and read the port from sc._gateway.gateway_parameters.port. To verify the Py4j gateway port via the environment variable PYSPARK_GATEWAY_PORT.
For Example: import os gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])

Now, with some of the important background out of the way, let's use the Universal Driver(db2jcc.jar) JDBC to interact with DB2 Database. The JDBC driver connects directly to the database server using Java. The connection URL takes the form jdbc:db2://server1:50000/phoned. The DB2 server in this case is listening for the client connections on port 50000. This is the default port that DB2 listens to upon installation unless you specify otherwise. Also, note that the hostname(server1) and the port number is included in the database connection URL.

 "Be a Lifelong Student. The more you learn, the more you earn and the more self-confidence you will have..." 

-Brain Tracy


To view the PySpark Python Notebook Source Code solution written in Scala, Java, Py4j, and Python, please pay a small fee to gain a peek at learning something new to improve you.