Netty+HTTP/2:一个诡异的超时问题

背景

在描述遇到的这个问题之前，需要先介绍一下背景，是在怎样一个环境下遇到的。

服务端是基于netty实现的http server，我选择直接使用了RxNetty的http server封装。

这个server的本质是一个代理服务器，在这里先称之为proxyserver，它将客户端发送过来的请求转发给其他后端服务，proxyserver与其他服务之间也是通过HTTP进行通信。

和server的实现类似，proxyserver使用的httpclient也是基于RxNetty的封装，底层都是由netty完成。

OK，整个服务端逻辑其实非常简单，在Netty的ChannelHandler内只做了请求转发这一件事，一直以来运行也没有碰到过任何问题，直到有一天，我在proxyserver和后端服务之间开启了HTTP/2。

说到这里，有人可能会觉得奇怪，为什么要在代理服务器和后端服务之间使用HTTP/2呢，常见的反代服务Nginx也不支持这种功能，因为根据Nginx官方的测试数据显示，在代理服务器和后端服务之间开始HTTP/2反而会使得耗时增加，不如HTTP/1.1。

之所以在知道这个结论的基础上也要使用HTTP/2的原因有两个：

1.由于都是内部服务，这里使用的是HTTP/2的h2c实现，而nginx的测试是使用了h2，需要对数据进行加密，而h2c则不需要对数据进行加密，详细可以看规范介绍：https://httpwg.org/specs/rfc7540.html#versioning

所以在响应耗时方面不会存在劣势。

2.本文所述的proxyserver并不是为了取代Nginx，它的作用类似于sidecar的实现，所以更关注于在系统资源和网络开销上，而HTTP/2协议在这方面相比HTTP/1.1有比较明显的优势。

回到正题，今天要描述的一个问题就是在这样一个服务开启HTTP/2之后碰到的。

问题排查过程

某日，在各服务间开启HTTP/2后，有一个proxyserver实例始终会出现请求超时现象，而且100%可以复现，但是同样代码部署的其他实例却没有任何问题。尝试把HTTP/2关闭后，这个问题立刻消失了。

为了查明原因，看清楚请求为何超时，将对应的debug日志开启，并且抓包进行了分析，得到了如下日志：

2018-09-09 10:35:48,603 INFO [proxyserver.http.HttpRequestHandler] - method:GET; uri:/a/b/c;
2018-09-09 10:35:48,604 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x5a8ab566] REGISTERED
2018-09-09 10:35:53,604 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x6ae0666a] REGISTERED
2018-09-09 10:35:58,605 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0xf50769cb] REGISTERED
2018-09-09 10:36:03,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x2d4c5027] REGISTERED
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x43461a61, L:/127.0.0.1:6789 - R:/127.0.0.1:13098] READ COMPLETE
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x43461a61, L:/127.0.0.1:6789 - R:/127.0.0.1:13098] FLUSH
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x581de2a6] CONNECT: /10.1.8.126:80
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x04bc9acd] CONNECT: /10.1.8.126:80
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x214017e5] CONNECT: /10.1.9.103:80
2018-09-09 10:36:08,606 DEBUG [io.netty.handler.logging.LoggingHandler] - [id: 0x25477082] CONNECT: /10.1.9.103:80

奇怪的是，proxyserver向其他服务发出的请求始终是在proxyserver接受到请求之后20秒，并且根据对应的抓包数据分析，proxyserver和另一个服务的TCP连接就是20秒后才建立。

虽说HTTP/2关闭后就恢复正常了，可是根据这个现象来看完全不符合逻辑，应用层协议的不同为什么会影响到TCP连接的建立？

根据抓包数据来看，proxyserver的确是20秒后才发出了第一个SYN包开始建立TCP连接。而且同样的代码，为什么只有其中一个实例有问题？于是我开始怀疑环境的差异。

首先我考虑了操作系统层面的问题，是否是操作系统层面的某些因素导致连接延迟建立了呢？如果有，应该会有对应的内核日志信息。确认了一遍内核日志以及回头看netty的debug日志，我排除了这种可能性。

然后我反复确认了netty在建立连接时的代码实现，就如之前所说的，完全看不出，也不应该看出在这个层面HTTP/1.x和HTTP/2的区别：

final ChannelFuture connectFuture = clientBootstrap.connect(serverInfo.getHost(), serverInfo.getPort());

硬要说区别的话只有HTTP/2使用的ChannelHandler不同，但是，在连接没有建立，channel没有初始化的情况下，这些差别根本没有任何影响。

继续找环境的差异，但是proxyserver采用的是docker部署，环境也不应该有差别。正当我放弃这个念头的时候，突然发现，环境还的确是有差异的，没错，区别就是CPU核数和内存大小。

OK，确认了这个区别以后我开始对比实验，有问题的实例分配的资源是1核2G，其他没问题的实例都是2核4G，有问题的那台因为某些特殊原因导致分配的资源规格不统一。

为了验证是否CPU核数和内存大小的区别导致了这个问题，我分别把资源设置为2核2G和1核4G进行了对照实验，结果表明：只要是1核的都跪了！！！

到这里，至少是找到一点线索了，只有单核就不能开HTTP/2这并不能接受啊，一想到CPU核数会造成影响，自然把目光瞄准了netty和RxNetty，没错，这些框架是经常会通过调用Runtime.availableProcessors()来判断CPU核数的。

查看了RxNetty的build逻辑后确认了这一点：

public C build() {
        if (null == socketChannel) {
            socketChannel = defaultSocketChannelClass();
            if (null == eventLoopGroup) {
                eventLoopGroup = defaultEventloop(socketChannel);
            }
        }
        if (null == eventLoopGroup) {
            if (defaultSocketChannelClass() == socketChannel) {
                eventLoopGroup = defaultEventloop(socketChannel);
            } else {
                // Fail fast for defaults we do not support.
                throw new IllegalStateException("Specified a channel class but not the event loop group.");
            }
        }
        bootstrap.channel(socketChannel).group(eventLoopGroup);
        if (null != wireLogginLevel) {
            pipelineConfigurator = PipelineConfigurators.appendLoggingConfigurator(pipelineConfigurator,
                                                                                   wireLogginLevel);
        }
        if(null != sslEngineFactory) {
            appendPipelineConfigurator(PipelineConfigurators.<O,I>sslConfigurator(sslEngineFactory));
        }
        final C client = createClient();
        if (null != eventListenersFactory) {
            final MetricEventsListener<? extends ClientMetricsEvent<?>> listener =
                    newMetricsListener(eventListenersFactory, client);
            client.subscribe(listener);
        }
        return client;
    }

查看defaultEventloop的逻辑发现，这里的EventLoopGroup由一个名为SingleNioLoopProvider的对象创建

protected EventLoopGroup defaultEventloop(Class<? extends Channel> socketChannel) { return RxNetty.getRxEventLoopProvider().globalClientEventLoop(true); // get native eventloop if configured.}

RxNetty封装了一个SharedNioEventLoopGroup，服务端和客户端共用了这个EventLoopGroup。

private static volatile RxEventLoopProvider rxEventLoopProvider =
            new SingleNioLoopProvider(1, Runtime.getRuntime().availableProcessors());
public SingleNioLoopProvider(int parentEventLoopCount, int childEventLoopCount) {
        this.parentEventLoopCount = parentEventLoopCount;
        this.childEventLoopCount = childEventLoopCount;
        parentEventLoop = new SharedNioEventLoopGroup(parentEventLoopCount);
        eventLoop = new SharedNioEventLoopGroup(childEventLoopCount);
        nativeParentEventLoop = new AtomicReference<EpollEventLoopGroup>();
        nativeEventLoop = new AtomicReference<EpollEventLoopGroup>();
    }

由于这里创建的NioEventLoopGroup对应的线程数是由CPU核数确定的，所以单核情况下就只有1个线程。

看到这里，似乎有点头绪了，服务端和客户端共用一个EventLoopGroup，并且只有一个线程在处理事件。那么剩下的就是确认一下netty的处理逻辑。

netty在创建连接时，首先会先注册channel，然后向EventLoop提交任务，源码部分如下：

private ChannelFuture doResolveAndConnect(final SocketAddress remoteAddress, final SocketAddress localAddress) {
    final ChannelFuture regFuture = initAndRegister();
    final Channel channel = regFuture.channel();
 
    if (regFuture.isDone()) {
        if (!regFuture.isSuccess()) {
            return regFuture;
        }
        return doResolveAndConnect0(channel, remoteAddress, localAddress, channel.newPromise());
    } else {
        // Registration future is almost always fulfilled already, but just in case it's not.
        final PendingRegistrationPromise promise = new PendingRegistrationPromise(channel);
        regFuture.addListener(new ChannelFutureListener() {
            @Override
            public void operationComplete(ChannelFuture future) throws Exception {
                // Directly obtain the cause and do a null check so we only need one volatile read in case of a
                // failure.
                Throwable cause = future.cause();
                if (cause != null) {
                    // Registration on the EventLoop failed so fail the ChannelPromise directly to not cause an
                    // IllegalStateException once we try to access the EventLoop of the Channel.
                    promise.setFailure(cause);
                } else {
                    // Registration was successful, so set the correct executor to use.
                    // See https://github.com/netty/netty/issues/2586
                    promise.registered();
                    doResolveAndConnect0(channel, remoteAddress, localAddress, promise);
                }
            }
        });
        return promise;
    }
}
 
 
private static void doConnect(
        final SocketAddress remoteAddress, final SocketAddress localAddress, final ChannelPromise connectPromise) {
 
    // This method is invoked before channelRegistered() is triggered.  Give user handlers a chance to set up
    // the pipeline in its channelRegistered() implementation.
    final Channel channel = connectPromise.channel();
    channel.eventLoop().execute(new Runnable() {
        @Override
        public void run() {
            if (localAddress == null) {
                channel.connect(remoteAddress, connectPromise);
            } else {
                channel.connect(remoteAddress, localAddress, connectPromise);
            }
            connectPromise.addListener(ChannelFutureListener.CLOSE_ON_FAILURE);
        }
    });
}

对NioEventLoop进行debug后确认，在客户端发起请求时，成功向netty的任务队列提交了任务，但是在polltask的时候始终没有获取到建立连接的任务：

@Override
public void execute(Runnable task) {
    if (task == null) {
        throw new NullPointerException("task");
    }
 
    boolean inEventLoop = inEventLoop();
    addTask(task);
    if (!inEventLoop) {
        startThread();
        if (isShutdown() && removeTask(task)) {
            reject();
        }
    }
 
    if (!addTaskWakesUp && wakesUpForTask(task)) {
        wakeup(inEventLoop);
    }
}
 
 
/**
 * Poll all tasks from the task queue and run them via {@link Runnable#run()} method.  This method stops running
 * the tasks in the task queue and returns if it ran longer than {@code timeoutNanos}.
 */
protected boolean runAllTasks(long timeoutNanos) {
    fetchFromScheduledTaskQueue();
    Runnable task = pollTask();
    if (task == null) {
        afterRunningAllTasks();
        return false;
    }
    final long deadline = ScheduledFutureTask.nanoTime() + timeoutNanos;
    long runTasks = 0;
    long lastExecutionTime;
    for (;;) {
        safeExecute(task);
        runTasks ++;
        // Check timeout every 64 tasks because nanoTime() is relatively expensive.
        // XXX: Hard-coded value - will make it configurable if it is really a problem.
        if ((runTasks & 0x3F) == 0) {
            lastExecutionTime = ScheduledFutureTask.nanoTime();
            if (lastExecutionTime >= deadline) {
                break;
            }
        }
        task = pollTask();
        if (task == null) {
            lastExecutionTime = ScheduledFutureTask.nanoTime();
            break;
        }
    }
    afterRunningAllTasks();
    this.lastExecutionTime = lastExecutionTime;
    return true;
}

那么为什么Connect任务一直没有被执行呢？

从逻辑上推断，这里只有两种可能性：

1.任务队列堆积

2.单线程情况下，之前的任务阻塞了

通过debug，已经排除了任务队列堆积的可能性，那么就只剩下第2种可能了，Connect之前的任务有Register和netty处理请求的ChannelRead。

首先Register不是阻塞的，而且Register在任务提交时就已经执行了，并且通过日志就可以确定Register没有阻塞，execute方法中，判断不在EventLoop中则会在addTask之后直接startThread。

而startThread方法内部会判断NioEventLoop当前的状态，如果是第一次调用，则会直接启动该线程：

private void startThread() {
    if (state == ST_NOT_STARTED) {
        if (STATE_UPDATER.compareAndSet(this, ST_NOT_STARTED, ST_STARTED)) {
            try {
                doStartThread();
            } catch (Throwable cause) {
                STATE_UPDATER.set(this, ST_NOT_STARTED);
                PlatformDependent.throwException(cause);
            }
        }
    }
}

排除了Register那么阻塞的只有可能是ChannelRead了。

但是ChannelRead为什么会阻塞呢？

阻塞来源于HTTP/2的协议升级，由于最开始要对协议进行升级，确认服务端是否支持HTTP/2，有个upgrade过程，h2c的实现是先发送一个HTTP/1.1请求包含upgrade header来完成，这个请求是阻塞的。

Options / HTTP/1.1
Host: xxx.com
Connection: Upgrade, HTTP2-Settings
Upgrade: h2c
HTTP2-Settings: <base64url encoding of HTTP/2 SETTINGS payload>

于是问题就来了，在单核情况下，RxNetty创建的eventloopGroup的executor只有一个线程，ChannelRead必须等到请求发送完毕后才算执行完毕。

发送HTTP/2必须先建立TCP连接且升级协议，Connect任务又必须等Channel Read执行完毕后才能执行，如此便进入了一个循环等待的过程，直到Channel Read超时。

而在HTTP/1.1的场景下由于发起请求完全不存在阻塞的情况，只需要将请求提交即可，任务队列中的任务也就不会存在依赖关系。

整个流程如下图所示：

解决方案

确定超时的原因是HTTP/2的协议升级造成的那么首先是先想到如何从协议升级这边去优化。

考虑到协议升级属于初始化操作，我们可以选择将建立连接操作设置在应用初始化阶段而不是在处理请求阶段。

这种方案有比较明显的漏洞：

第一，如果连接断开了需要重新建立连接，那么始终无法避免在请求阶段去创建。

第二，如果客户端和服务端之间长时间没有请求，那么服务端会主动关闭空闲的连接，进而变成第一个问题。

既然无法避免在请求阶段建立连接，那么就需要想办法将协议升级操作变成非阻塞的。

但由于使用HTTP/2的情况下，只有在协议升级结束后才能确定是否可以开启HTTP/2，所以一旦这一步变成了非阻塞的，那么我们只有在协议升级完毕之前创建新的连接使用HTTP/1.1了。

虽然在开启HTTP/2之前有部分请求是采用HTTP/1.1的，但是通过这种方式，就避免了在单核情况下产生循环依赖问题导致超时了。

当然我们也可以主动把对应的EventLoopGroup使用的线程数进行一些修改来避免使用1个线程的情况。

看清本质以后，会发现这个问题其实很简单，而且是在使用netty编程的过程中常遇到的：尽量不要在ChannelHandler中使用阻塞逻辑。

而本例由于只有在建立连接时才会出现阻塞，使得在实现时疏忽了这一点，从而碰到了这样看似诡异的问题。

关注我们

酷家乐质量效能团队热衷于技术的成长和分享，几乎每个月都会举办技术分享活动（海星日），每半年举办一次技术专题竞赛分享（火星日），并将优秀内容写成技术文章。

我们尽可能保障分享到社区的内容，是我们用心编写、精心挑选的优质文章。如果您想更全面地阅读我们的文章，请您关注我们的微信公众号"酷家乐技术质量"。