加载中...
Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
86 views

The question is: are the proposed methods for flattening a given dataframe efficient enough, or could they be further refined? An example of a pandas input dataframe columns (it is a given, cannot be ...
complicat9d's user avatar
0 votes
0 answers
106 views

I've build a spark and minio docker container with the below config: services: spark: build: . command: sleep infinity container_name: spark volumes: - ./spark_scripts:/opt/...
vyangesh's user avatar
0 votes
1 answer
52 views

Body I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash. Environment OS: Windows 10 Spark: Apache Spark (PySpark) IDE: VS Code ...
Deepika Goyal's user avatar
0 votes
1 answer
37 views

I can define a materalized_view with the latest feature of Spark, but when I try to execute it again, then I got an error. The location exists already. pyspark.errors.exceptions.connect....
user avatar
Best practices
0 votes
1 replies
29 views

I have a list of IDs and want to make a get request to a REST-API for each of the ids and save the results in a dataframe. If I loop over the list it takes far too long so I tried to parallelize using ...
yannik0103's user avatar
0 votes
0 answers
30 views

I have a pyspark job that write dataframe to s3 with partitions. the partition value is string. in my pyspark script, I have the line: spark.sql("MSCK REPAIR TABLE table_name SYNC PARTITIONS"...
Dozel's user avatar
  • 169
6 votes
0 answers
118 views

I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...
Srinivas Kumar's user avatar
0 votes
0 answers
64 views

I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...
Rob Fisher's user avatar
  • 1,025
4 votes
1 answer
129 views

I want to be able to combine rows at PersonID level if JobTitleID are the same consecutively, where the timestamp column gets extended if the same. For example this is the raw data: I want the output ...
tommyhmt's user avatar
  • 364
0 votes
0 answers
70 views

I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI. I have a simple Bronze → Silver → Gold pipeline, but I keep getting: pyspark....
AChaudhury's user avatar
Best practices
1 vote
6 replies
56 views

Grouping a Pyspark dataframe, applying time series analysis UDF to each group SOLVED See below I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...
Jernau's user avatar
  • 93
0 votes
0 answers
75 views

I have been trying hard to test my PySpark transformation on my local Windows machine. Here is what I have done so far. I installed the latest version of Spark, downloaded hadoop.dll and winutils, ...
Rishabh's user avatar
  • 88
2 votes
0 answers
79 views

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...
Paweł Sopel's user avatar
1 vote
1 answer
86 views

I have an AWS Glue job that processes thousands of small JSON files from S3 (historical data load for Adobe Experience Platform). The job is taking approximately 4 hours to complete, which is ...
Jayron Soares's user avatar
-1 votes
1 answer
134 views

I have two spark scripts, first as a bronze script need to data form kafka topics each topic have ads platform data ( tiktok_insights, meta_insights, google_insights ). Structure are same, ( id, ...
Kuldeep KV's user avatar

15 30 50 per page
1
2 3 4 5
2734