40,998 questions
0
votes
1
answer
86
views
ETL on a Overcomplex Data Structure with pandas and pyspark
The question is: are the proposed methods for flattening a given dataframe efficient enough, or could they be further refined?
An example of a pandas input dataframe columns (it is a given, cannot be ...
0
votes
0
answers
106
views
How to read/write in minio using spark?
I've build a spark and minio docker container with the below config:
services:
spark:
build: .
command: sleep infinity
container_name: spark
volumes:
- ./spark_scripts:/opt/...
0
votes
1
answer
52
views
PySpark .show() fails with “Python worker exited unexpectedly” on Windows (Python 3.14)
Body
I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash.
Environment
OS: Windows 10
Spark: Apache Spark (PySpark)
IDE: VS Code
...
0
votes
1
answer
37
views
Spark Declarative Pipelines - How to refresh a materalized view?
I can define a materalized_view with the latest feature of Spark, but when I try to execute it again, then I got an error. The location exists already.
pyspark.errors.exceptions.connect....
Best practices
0
votes
1
replies
29
views
Parallelizing REST-API requests in Databricks
I have a list of IDs and want to make a get request to a REST-API for each of the ids and save the results in a dataframe. If I loop over the list it takes far too long so I tried to parallelize using ...
0
votes
0
answers
30
views
msck repair table sync partitions fails
I have a pyspark job that write dataframe to s3 with partitions. the partition value is string. in my pyspark script, I have the line:
spark.sql("MSCK REPAIR TABLE table_name SYNC PARTITIONS"...
6
votes
0
answers
118
views
How to make spark reuse python workers where we have done some costly init set up?
I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...
0
votes
0
answers
64
views
Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger
I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...
4
votes
1
answer
129
views
Combine rows and extend timestamp column if same as previous row
I want to be able to combine rows at PersonID level if JobTitleID are the same consecutively, where the timestamp column gets extended if the same.
For example this is the raw data:
I want the output ...
0
votes
0
answers
70
views
Spark Declarative Pipelines (SDP) – TABLE_OR_VIEW_NOT_FOUND for upstream table even though it is defined
I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI.
I have a simple Bronze → Silver → Gold pipeline, but I keep getting:
pyspark....
Best practices
1
vote
6
replies
56
views
How to run Pyspark UDF separately over dataframe groups
Grouping a Pyspark dataframe, applying time series analysis UDF to each group
SOLVED See below
I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...
0
votes
0
answers
75
views
Pytest spark fixture failing on startup
I have been trying hard to test my PySpark transformation on my local Windows machine.
Here is what I have done so far.
I installed the latest version of Spark, downloaded hadoop.dll and winutils, ...
2
votes
0
answers
79
views
'JavaPackage' object is not callable error when trying to getOrCreate() local spark session
I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...
1
vote
1
answer
86
views
AWS Glue PySpark job taking 4 hours to process small JSON files from S3
I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...
-1
votes
1
answer
134
views
Optimize code to flatten meta ads metrics data in spark
I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same,
( id, ...