Newest 'pyspark' Questions

0 votes

1 answer

86 views

ETL on a Overcomplex Data Structure with pandas and pyspark

The question is: are the proposed methods for flattening a given dataframe efficient enough, or could they be further refined? An example of a pandas input dataframe columns (it is a given, cannot be ...

complicat9d

11

asked Jan 27 at 18:21

0 votes

0 answers

106 views

How to read/write in minio using spark?

I've build a spark and minio docker container with the below config: services: spark: build: . command: sleep infinity container_name: spark volumes: - ./spark_scripts:/opt/...

vyangesh

1

asked Jan 26 at 6:21

0 votes

1 answer

52 views

PySpark .show() fails with “Python worker exited unexpectedly” on Windows (Python 3.14)

Body I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash. Environment OS: Windows 10 Spark: Apache Spark (PySpark) IDE: VS Code ...

Deepika Goyal

1

asked Jan 21 at 5:59

0 votes

1 answer

37 views

Spark Declarative Pipelines - How to refresh a materalized view?

I can define a materalized_view with the latest feature of Spark, but when I try to execute it again, then I got an error. The location exists already. pyspark.errors.exceptions.connect....

user32226066

asked Jan 19 at 10:38

Best practices

0 votes

1 replies

29 views

Parallelizing REST-API requests in Databricks

I have a list of IDs and want to make a get request to a REST-API for each of the ids and save the results in a dataframe. If I loop over the list it takes far too long so I tried to parallelize using ...

yannik0103

13

asked Jan 17 at 6:37

0 votes

0 answers

30 views

msck repair table sync partitions fails

I have a pyspark job that write dataframe to s3 with partitions. the partition value is string. in my pyspark script, I have the line: spark.sql("MSCK REPAIR TABLE table_name SYNC PARTITIONS"...

Dozel

169

asked Jan 13 at 19:47

6 votes

0 answers

118 views

How to make spark reuse python workers where we have done some costly init set up?

I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...

Srinivas Kumar

41

asked Jan 13 at 18:51

0 votes

0 answers

64 views

Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger

I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...

Rob Fisher

1,025

asked Jan 12 at 14:16

4 votes

1 answer

129 views

Combine rows and extend timestamp column if same as previous row

I want to be able to combine rows at PersonID level if JobTitleID are the same consecutively, where the timestamp column gets extended if the same. For example this is the raw data: I want the output ...

tommyhmt

364

asked Jan 11 at 19:09

0 votes

0 answers

70 views

Spark Declarative Pipelines (SDP) – TABLE_OR_VIEW_NOT_FOUND for upstream table even though it is defined

I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI. I have a simple Bronze → Silver → Gold pipeline, but I keep getting: pyspark....

AChaudhury

1

asked Jan 9 at 9:55

Best practices

1 vote

6 replies

56 views

How to run Pyspark UDF separately over dataframe groups

Grouping a Pyspark dataframe, applying time series analysis UDF to each group SOLVED See below I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...

Jernau

93

asked Jan 8 at 10:52

0 votes

0 answers

75 views

Pytest spark fixture failing on startup

I have been trying hard to test my PySpark transformation on my local Windows machine. Here is what I have done so far. I installed the latest version of Spark, downloaded hadoop.dll and winutils, ...

Rishabh

88

asked Dec 29, 2025 at 11:56

2 votes

0 answers

79 views

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...

Paweł Sopel

557

asked Dec 25, 2025 at 10:22

1 vote

1 answer

86 views

AWS Glue PySpark job taking 4 hours to process small JSON files from S3

I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...

Jayron Soares

461

asked Dec 20, 2025 at 12:08

-1 votes

1 answer

134 views

Optimize code to flatten meta ads metrics data in spark

I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same, ( id, ...

Kuldeep KV

19

asked Dec 17, 2025 at 6:41

Collectives™ on Stack Overflow

ETL on a Overcomplex Data Structure with pandas and pyspark

How to read/write in minio using spark?

PySpark .show() fails with “Python worker exited unexpectedly” on Windows (Python 3.14)

Spark Declarative Pipelines - How to refresh a materalized view?

Parallelizing REST-API requests in Databricks

msck repair table sync partitions fails

How to make spark reuse python workers where we have done some costly init set up?

Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger

Combine rows and extend timestamp column if same as previous row

Spark Declarative Pipelines (SDP) – TABLE_OR_VIEW_NOT_FOUND for upstream table even though it is defined

How to run Pyspark UDF separately over dataframe groups

Pytest spark fixture failing on startup

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

AWS Glue PySpark job taking 4 hours to process small JSON files from S3

Optimize code to flatten meta ads metrics data in spark

Hot Network Questions