0

I need to run Apache Airflow in a corporate network. For that I need to set "http_proxy", "https_proxy" and "no_proxy" in any machine I want to use internet.

Right now, the VM that I'm using to run Airflow stores these env. variables in /etc/profile.

I can run Python scripts that make HTTP requests to external websites with ease, when I run them on the terminal, but when I run them inside a DAG, it breaks because it couldn't resolve/access the address.

It seems that Airflow runs scripts in an isolated environment. I am currently using CeleryExecutor.

Firstly, I've accessed all the environment variables with a print(environ). I got this:

environ({'LANG': 'en_US.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/airflow', 'LOGNAME': 'airflow', 'USER': 'airflow', 'SHELL': '/bin/bash', 'INVOCATION_ID': '5c777ce3b07748309b972d877a0545ea', 'JOURNAL_STREAM': '9:37430', 'AIRFLOW_CONFIG': '/opt/airflow/airflow.cfg', 'AIRFLOW_HOME': '/opt/airflow', '_MP_FORK_LOGLEVEL_': '20', '_MP_FORK_LOGFILE_': '', '_MP_FORK_LOGFORMAT_': '[%(asctime)s: %(levelname)s/%(processName)s] %(message)s', 'CELERY_LOG_LEVEL': '20', 'CELERY_LOG_FILE': '', 'CELERY_LOG_REDIRECT': '1', 'CELERY_LOG_REDIRECT_LEVEL': 'WARNING', 'AIRFLOW_CTX_DAG_OWNER': 'airflow', 'AIRFLOW_CTX_DAG_ID': 'primeiro-teste', 'AIRFLOW_CTX_TASK_ID': 'extract', 'AIRFLOW_CTX_EXECUTION_DATE': '2022-12-13T16:18:17.185417+00:00', 'AIRFLOW_CTX_DAG_RUN_ID': 'manual__2022-12-13T16:18:17.185417+00:00'})

There is no proxy variables, so the script cannot access outside information.

I've even debugged within a DAG which were the DNS servers, to see if they were correct. The result was positive.

The only way I got the script to work was by getting these environ variables defined before running an HTTP request:

os.environ['HTTP_PROXY'] = os.environ['http_proxy'] = os.environ['HTTPS_PROXY'] = os.environ['https_proxy'] = "PROXY STRING"

I was hoping to find a way to get these variables defined for all DAGs, but when I set them like Tomasz, I can't seem to use them if they don't start with the "AIRFLOW" prefix.

1 Answer 1

2

Creating an environment file and putting it in some location is not sufficient. You have to tell Airflow about the location of that file when it starts, however you do that (e.g. systemd).

Airflow gets its environment variables very specifically. When Airflow starts you need to reference the environment file created for Airflow. When you run Airflow using systemd you can specify which EnvironmentFile that you would like Airflow to use, under the [Service] section of the unit file. Environment variables not defined within that file will not be picked up by Airflow. Your unit files may look different to mine but here is mine as an example:

[Unit]
Description=Airflow webserver daemon
After=network.target mysqld.service rabbitmq-server.service
Wants=mysqld.service rabbitmq-server.service

[Service]
EnvironmentFile=/prod/airflow/airflow.env
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c "source /prod/airflow/airflow_38_venv/bin/activate ; /prod/airflow/airflow_38_venv/bin/airflow webserver -p 7635 --pid /prod/airflow/run/webserver.pid"
Restart=on-failure
RestartSec=5s
PrivateTmp=true

[Install]
WantedBy=multi-user.target

EnvironmentFile can point to any location/filename that the user running Airflow has read access to. The suggested filename and location are /etc/sysconfig/airflow but as you can see mine is different than what is recommended.

Here is what the body of my EnvironmentFile looks like, edited to remove specific details. Again, yours will probably look different.

$ cat /prod/airflow/airflow.env
# This file is the environment file for Airflow. Put this file in /etc/sysconfig/airflow per default
# configuration of the systemd unit files.
#
AIRFLOW_CONFIG=/prod/airflow/airflow.cfg
AIRFLOW_HOME=/prod/airflow
http_proxy=http://something.proxyserver.com:80
https_proxy=http://something.proxyserver.com:80
no_proxy=*.google.com,127.0.0.1
HTTP_PROXY=http://something.proxyserver.com:80
HTTPS_PROXY=http://something.proxyserver.com:80
NO_PROXY=*.google.com,127.0.0.1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.