How to Fix the “No FileSystem for Scheme ‘gs'” Error in PySpark

I’ve been working on a simple PySpark project that writes a Data Frame to a Google Cloud Storage bucket. Everything seemed straightforward until I encountered this dependency error when trying to use the cloud storage connector with Hadoop3.

The First Error

When I ran my PySpark code, I got an error that stopped me in my tracks:

py4j.protocol.Py4JJavaError: An error occurred while calling o49.parquet.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
...

What Happened?

This error occurs because Spark couldn’t find a proper implementation for handling the "gs" scheme. In my original code, I was trying to use the connector dependency:

.config("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.26")

Due to dependency mismanagement or incompatibility issues, Spark wasn’t able to load the appropriate class for the Google Cloud Storage file system. That’s why I kept seeing errors—even after trying various versions of the connector and tweaking configurations.

Explanation of the Original Code

Here’s the code that was causing the error:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

def main():
# Initialize Spark session with GCS support
spark = SparkSession.builder \
.appName("Basic PySpark Example") \
.config("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.26") \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.fs.gs.auth.service.account.json.keyfile", "<key-file>.json") \
.getOrCreate()

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

data = [
("John", 30, "New York"),
("Alice", 25, "London"),
("Bob", 35, "Paris")
]
df = spark.createDataFrame(data, schema=schema)
df.show()

gcs_bucket = "name"
df.write.parquet(f"gs://{gcs_bucket}/data/people.parquet")
df.write.csv(f"gs://{gcs_bucket}/data/people.csv")

print(f"Data written to GCS bucket: {gcs_bucket}")
spark.stop()

if __name__ == "__main__":
main()

How the Code is Intended to Work

  • Spark Session Initialization:
    The session is set up to support Google Cloud Storage by including the GCS connector.
  • Schema and Data Creation:
    I defined a schema and created a DataFrame with sample data.
  • Data Writing:
    The DataFrame is written to both Parquet and CSV formats in a GCS bucket.
  • Dependency Misstep:
    The configuration line that adds the connector via spark.jars.packages didn’t load the correct classes for the "gs" file system scheme, leading to the error.

Using an Alternative Connector Approach

After a lot of trial and error, I found that substituting the dependency reference fixed the issue. Instead of using the package manager to fetch the jar, I directly referenced the jar file hosted on Google’s storage:

Modified Spark Session Initialization

Replace:

.config("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.26")

With:

.config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar")

This ensures Spark downloads the latest version of the GCS connector jar directly from Google’s storage, which appears to work better with my setup.

Enhanced Code with Extra Practice Functionality

I didn’t stop at just fixing the dependency error. I wanted to add some practice functionality to make the project more robust and provide useful logging. Here’s the enhanced version of my code:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import logging
import sys

def setup_logging():
# Set up logging to both file and stdout
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.StreamHandler(sys.stdout),
logging.FileHandler("pyspark_app.log")
]
)
return logging.getLogger("PySparkApp")

def main():
logger = setup_logging()
logger.info("Starting PySpark application with GCS support.")

# Initialize Spark session with GCS support using direct jar reference
spark = SparkSession.builder \
.appName("Basic PySpark Example") \
.config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar") \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.fs.gs.auth.service.account.json.keyfile", "<key-file>.json") \
.getOrCreate()

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

data = [
("John", 30, "New York"),
("Alice", 25, "London"),
("Bob", 35, "Paris")
]
df = spark.createDataFrame(data, schema=schema)
logger.info("DataFrame created successfully.")
df.show()

gcs_bucket = "name"

try:
df.write.parquet(f"gs://{gcs_bucket}/data/people.parquet")
df.write.csv(f"gs://{gcs_bucket}/data/people.csv")
logger.info(f"Data written to GCS bucket: {gcs_bucket}")
except Exception as e:
logger.error("Failed to write data to GCS.", exc_info=True)
finally:
spark.stop()
logger.info("Spark session stopped.")

if __name__ == "__main__":
main()

What’s New and Why

  • Logging Setup:
    I added a logging configuration to output messages both to the console and to a log file. This makes it easier to debug and trace what’s happening during execution.
  • Error Handling:
    The data writing step is wrapped in a try-except block to catch and log any exceptions that may occur during the write process. This helps diagnose issues quickly.
  • Cleaner Shutdown:
    Whether the write succeeds or fails, the Spark session is stopped, and an appropriate log message is recorded.

Final Thoughts

By switching from a dependency managed via spark.jars.packages to directly referencing the jar file, I finally got my PySpark application to correctly recognize the "gs" scheme and interact with Google Cloud Storage. Adding logging and error handling further improved the robustness of the project.

This experience reinforced the importance of understanding how dependency management works in distributed systems like Spark. It’s not just about getting your code to run it’s also about making your application resilient and easier to debug in production environments.

Related blog posts