1

I am trying to run my spark job in airflow, when I executed this command spark-submit --class dataload.dataload_daily /home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar in terminal, it works fine without any issue.

However, I am doing the same here in airflow, but keep getting the error

/tmp/airflowtmpKQMdzp/spark-submit-scalaWVer4Z: line 1: spark-submit: command not found

t1 = BashOperator(task_id = 'spark-submit-scala',
bash_command = 'spark-submit --class dataload.dataload_daily \
/home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar',
dag=dag,
retries=0,
start_date=datetime(2018, 4, 14))

I have my spark path mentioned in bash_profile,

export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7
export PATH="$SPARK_HOME/bin/:$PATH"

sourced this file as well. Not sure how to debug this, can anyone help me on this?

1 Answer 1

2

You could start with bash_command = 'echo $PATH' to see if your path is being updated correctly.

This is because you are metioning editing the bash_profile, but as far as I know Airflow is being run as another user. Since the other user has no changes in the bash_profile, the path to Spark might be missing.

As mentioned here (How do I set an environment variable for airflow to use?) you could try setting the path in .bashrc.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.