问题描述 目前公司很多服务会通过nginx调用到外部服务,业务量起来的时候,通过查看上游日志,可以看到有大量的连接产生,随着连接数增加会给服务器带来巨大压力。 所以针对这个问题去做一些性能优化。 优化项如下:
更换基础nginx镜像为淘宝tengine镜像 — tengine 支持upstream同时还支持域名动态解析,同时编译支持lua,具体模块为: ngx_http_upstream_keepalive_module,ngx_http_upstream_dynamic_module。由于没有官方提供的tengine docker镜像 需要自己编译打包。
修改pod dns选项优化查询域。
配置文件参数调优,主要针对 https,以及websocket。
执行细节 1. 编译tengine Dockerfile 这里没啥说的,一点一点趟坑搞出来的,直接给出Dockerfile:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 FROM debian:buster-slim LABEL maintainer="VisIon <yuyiwei@bituniverse.org>" ENV TENGINE_VERSION 2.3.2 ENV CONFIG "\ --user=nginx \ --group=nginx \ --prefix=/etc/nginx \ --sbin-path=/usr/sbin/nginx \ --conf-path=/etc/nginx/nginx.conf \ --lock-path=/var/lock/nginx.lock \ --pid-path=/var/run/nginx.pid \ --error-log-path=/var/log/nginx/error.log \ --http-log-path=/var/log/nginx/access.log \ --http-client-body-temp-path=/var/cache/nginx/client_temp \ --http-proxy-temp-path=/var/cache/nginx/proxy_temp \ --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp \ --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp \ --http-scgi-temp-path=/var/cache/nginx/scgi_temp \ --with-http_ssl_module \ --with-http_gzip_static_module \ --with-http_gunzip_module \ --with-http_auth_request_module \ --with-http_image_filter_module \ --with-http_addition_module \ --with-http_dav_module \ --with-http_realip_module \ --with-http_v2_module \ --with-http_stub_status_module \ --with-http_sub_module \ --with-http_xslt_module \ --with-http_flv_module \ --with-http_mp4_module \ --with-http_secure_link_module \ --with-mail_ssl_module \ --with-stream \ --add-module=modules/ngx_http_lua_module \ --with-http_degradation_module \ --add-module=modules/ngx_http_upstream_dynamic_module \ --add-module=modules/ngx_http_upstream_session_sticky_module \ " RUN set -x \ && addgroup --system --gid 101 nginx \ && adduser --system --disabled-login --ingroup nginx --no-create-home --home /nonexistent --gecos "nginx user" --shell /bin/false --uid 101 nginx \ && apt-get update \ && apt-get -y install \ gcc \ wget \ libxslt-dev \ libxml2-dev \ libc-dev \ make \ linux-libc-dev \ curl \ libpcre3 \ libpcre3-dev \ openssl \ libssl-dev \ zlib1g \ zlib1g-dev \ libgd-dev \ libjemalloc-dev \ libjemalloc2 \ curl \ net-tools \ procps \ && curl "http://tengine.taobao.org/download/tengine-$TENGINE_VERSION.tar.gz" -o tengine.tar.gz \ && mkdir -p /usr/src \ && mkdir -p /var/cache/nginx \ && chown nginx:nginx /var/cache/nginx \ && tar -zxC /usr/src -f tengine.tar.gz \ && rm tengine.tar.gz \ && cd /usr/src/tengine-$TENGINE_VERSION \ && curl -L "https://github.com/openresty/luajit2/archive/v2.1-20200102.tar.gz" -o v2.1-20200102.tar.gz \ && tar -zxC /usr/src/tengine-$TENGINE_VERSION -f v2.1-20200102.tar.gz \ && rm v2.1-20200102.tar.gz \ && cd luajit2-2.1-20200102 && make && make install PREFIX=/usr/local/luajit \ && export LUAJIT_LIB=/usr/local/luajit/lib \ && export LUAJIT_INC=/usr/local/luajit/include/luajit-2.1 \ && echo '/usr/local/luajit/lib'>>/etc/ld.so.conf.d/usr_local_luajit_lib.conf \ && ldconfig \ && cd .. \ && ./configure $CONFIG \ && make \ && make install \ && ln -s /usr/local/lib/libluajit-5.1.so.2 /lib64/libluajit-5.1.so.2 \ && rm -rf /etc/nginx/html/ \ && mkdir /etc/nginx/conf.d/ \ && mkdir -p /usr/share/nginx/html/ \ && install -m644 html/index.html /usr/share/nginx/html/ \ && install -m644 html/50x.html /usr/share/nginx/html/ \ && strip /usr/sbin/nginx* \ && cd /etc/nginx \ && rm -rf /usr/src/nginx-$NGINX_VERSION \ && ln -sf /dev/stdout /var/log/nginx/access.log \ && ln -sf /dev/stderr /var/log/nginx/error.log COPY nginx.conf /etc/nginx/nginx.conf EXPOSE 80 CMD ["nginx", "-g", "daemon off;"]
default.conf 没啥说的 nginx.conf 因人而异。我这里给出我的: nginx.conf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 user nginx; worker_processes auto; pid /var/run/nginx.pid; worker_rlimit_nofile 16384; events { worker_connections 8192; } http { server_names_hash_bucket_size 128; map $http_upgrade $connection_upgrade { default upgrade; '' close; } log_format main '$remote_addr $host $remote_user [$time_local] "$request" $status $bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $upstream_addr'; client_header_timeout 300s; # If after this time the client send nothing, nginx returns error "Request time out" (408). client_body_timeout 300s; send_timeout 300s; # if after this time client will take nothing, then nginx is shutting down the connection. proxy_connect_timeout 10s; proxy_read_timeout 10s; proxy_send_timeout 10s; connection_pool_size 256; client_header_buffer_size 1k; large_client_header_buffers 4 2k; request_pool_size 4k; output_buffers 4 32k; postpone_output 1460; sendfile on; tcp_nopush on; keepalive_timeout 60 30; tcp_nodelay on; real_ip_header X-Forwarded-For; set_real_ip_from 0.0.0.0/0; client_max_body_size 10m; client_body_buffer_size 256k; gzip on; gzip_min_length 1100; gzip_comp_level 4; gzip_buffers 4 32k; gzip_types application/json text/plain application/x-javascript text/xml text/css; ignore_invalid_headers on; resolver 172.20.0.10 valid=10s ipv6=off; # only valid in bthub-vpc oregon limit_req_status 555; # 将受限制的请求错误码转换为555 proxy_ssl_server_name on; # 针对SNI的https保证能通 access_log /var/log/nginx/access.log main; error_log /var/log/nginx/error.log; include /etc/nginx/conf.d/*.conf; include /etc/nginx/sites-enabled/*; }
至此tengine 2.3.2 镜像已经制作完成。 使用debian:buster-slim 作为基础镜像原因就是它支持libc。
2. 优化查询域,这个是k8s里需要在deployment 改动的地方,主要的做法是,把默认的ndots 改小。 这里给出demo
1 2 3 4 5 6 dnsConfig: options: - name: single-request-reopen - name: ndots value: "4" dnsPolicy: ClusterFirst
其中 single-request-reopen 这个参数 alpine 是不支持的。 所以采用了 debian:buster-slim
3. 配置趟坑 主要是业务的配置
1 2 3 4 5 6 upstream pionex-broker { dynamic_resolve fallback=stale fail_timeout=30s; server xxx.com:443; keepalive 32; keepalive_timeout 30s; # 设置后端连接的最大idle时间为30s }
这里如果使用的是https,上游后面一定要跟 端口 —> 443 否则会报错。
到此问题已经解决使用了 upstream keepalive 和 dynamic_resolve 可以有效解决连接数过大导致性能压力问题。 经过测试,连接数由原来的几万瞬间降低到几百,延迟也由原来十几毫秒降低到几毫秒。