现象

某个版本开始，我们线上某埋点采集服务实例开始出现不定期OOM重启，造成业务消费延迟抖动

初步排查判断为堆外内存有泄漏，堆内3G无OOM，jvm进程RSS为5G，容器limit也为5G，触发OOM重启

排查

pmap查看进程内存段分配：

pmap -x {your_pid} | sort -k 3 -n | tail -n 20

可以看到堆外有大量几十M的内存段：

通过gdb dump某些内存段，string看下内容：

先看smaps映射的内存段起止地址

cat /proc/{your_pid}/smaps | grep 7fbe7c

通过gdb attach上jvm进程，把对应段dump下来

gdb attach {your_pid}
dump memory mem.bin 0x7fbe7c000000 0x7fbe7dc24000

dump下来的文件可以通过strings命令看下内容，如果是文本内容基本可以确定是什么组件写的。这种堆外OOM，大概率是某个组件有native调用，c/c++申请的内存，或网络io/broker等组件零拷贝等操作，在堆外申请的内存没有释放。（当然这个case特殊些，组件确实释放了，但是glibc的malloc实现没有实际归还给操作系统）

止损&验证，可以在gdb shell里尝试手动跑下malloc_trim（注意可能会导致jvm crash，小心使用）

call malloc_trim()

如果能手动trim掉，那极有可能是glibc malloc导致内存没有归还给操作系统的问题。简而言之，glibc为了减少调用和切换消耗，用户free释放掉的内存并不会归还给操作系统，而是等下一次再malloc时直接用（能省几次系统调用），进而导致未归还的内存段持续占用内存，直到进程被oom kill。

解决

切换jemalloc，调整相关参数去平衡这个行为。

另： jemalloc接入

#include <mcheck.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
 
 
void __mtracer_on () __attribute__((constructor));
void __mtracer_off () __attribute__((destructor));
void __mtracer_on ()
{
    char *p=getenv("MALLOC_TRACE");
    char tracebuf[1023];
    if(!p)
        p="malloc_trace";
    sprintf(tracebuf, "%s.%d", p, getpid());
    setenv("MALLOC_TRACE",tracebuf, 1);
    atexit(&__mtracer_off);
    mtrace();
}
 
void __mtracer_off ()
{
    muntrace();
}

build：

gcc mtrace.c  -fPIC -shared  -o libmtrace.so
vi /data/infra-apps/xxx/conf/xxx.sh
 
export MALLOC_TRACE="/tmp/malloc_trace.log"
export LD_PRELOAD="/tmp/libmtrace.so"
 
mtrace malloc_trace.log.478
mtrace malloc_trace.log.478 | sed '1,/Caller/d'|awk '{s[$NF]+=strtonum($2);n[$NF]++;}END{for(k in s){print k,n[k],s[k]}}'|column -t | sort -k 3 -n

Java 通过 JNI 调用 C / C++ / Golang

作者: syf
时间: 2024-02-06
分类: 技术
评论

C / C++

先定义 Java native 接口：

package jni;

/**
 * @author syf
 * @date 2024/2/6
 **/
public class JniMath {

    static {
        System.load("/tmp/libzichecpp.so");
        System.load("/tmp/libzichego.so");
    }

    public static native long multiply(long x, long y);

    public static native long multiplygo(long x, long y);

    public static void main(String[] args) {
        System.out.println(JniMath.multiply(12345, 67890));
        System.out.println(JniMath.multiplygo(12345, 67890));
    }
}

javah生成关联的头文件

javah jni.JniMath

生成的jni_JniMath.h

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class jni_JniMath */

#ifndef _Included_jni_JniMath
#define _Included_jni_JniMath
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     jni_JniMath
 * Method:    multiply
 * Signature: (JJ)J
 */
JNIEXPORT jlong JNICALL Java_jni_JniMath_multiply
  (JNIEnv *, jclass, jlong, jlong);

#ifdef __cplusplus
}
#endif
#endif

编写代码实现头文件中定义的函数：

#include "jni_JniMath.h"

JNIEXPORT jlong JNICALL Java_jni_JniMath_multiply
  (JNIEnv * env, jclass clazz, jlong argX, jlong argY) {

  return argX * argY;
}

编译 cpp，这里 JAVA_HOME 的头文件目录换成你的系统环境对应的

gcc -shared -fPIC -I$JAVA_HOME/include -I$JAVA_HOME/include/darwin src/jni/jni_JniMath.c -o /tmp/libzichecpp.so

调用见 Java 部分的代码，不重复了

Golang

golang 从 1.5 版本开始支持 c-shared 模式编译，可以作为动态链接库调用

同样这里注释中cgo引入的CFLAG，头文件目录换成你的系统环境对应目录

package main

// #cgo CFLAGS: -I/Users/syf/opt/zulu-jdk-8/include
// #cgo CFLAGS: -I/Users/syf/opt/zulu-jdk-8/include/darwin
// #include <jni.h>
import "C"

//export Java_jni_JniMath_multiplygo
func Java_jni_JniMath_multiplygo(env *C.JNIEnv, clazz C.jclass, x C.jlong, y C.jlong) C.jlong {
    return x * y
}

// main function is required, don't know why!
func main() {} // a dummy function

编译：

go build -buildmode=c-shared -o /tmp/libzichego.so src/jni/jni_math.go

调用见 Java 部分的代码，不重复了

场景

链路上底层服务CCC通过定时任务补偿发出业务MQ，此时MQ消息没有子版本；

下游配置MQ引流可解决直接消费问题，此时若消费逻辑中再调用其他服务，会发起无头流量，打到基线环境

下文中foo、bar、aaa、bbb、ccc均为链路上的微服务，此时期望流量打到ccc服务的v123虚拟环境版本，实际打到了基线的ccc

foo-v123 -> bar-v123 -> 基线aaa定时任务 -> MQ
                                         |-> bbb-v123(引流消费无头消息) -> ccc基线

解决

MQ client中已经开发了优先取链路版本的逻辑，取不到再取实例环境变量，实现MQ链路的无头传递

SkyWalking中没有这部分逻辑，加上类似逻辑：

private static String getTraceVersionOrEnv(CorrelationContext correlationContext, String headerName) {
    String version = correlationContext.get(headerName).orElse("");
    //  sniff virtual env version from 1. trace, 2. instance environment variable
    if (VIRTUAL_ENV_HEADER.equalsIgnoreCase(headerName) && StringUtil.isEmpty(version)) {
        version = Optional.ofNullable(System.getenv(VIRTUAL_ENV_ENVIRONMENT)).orElse("");
        if (StringUtil.isNotEmpty(version)) {
            //  put virtual env version into correlation context, or instance environment variable version won't be record
            correlationContext.put(VIRTUAL_ENV_HEADER, version);
        }
    }
    return version;
}

注意

之前没有加 correlationContext.put(VIRTUAL_ENV_HEADER, version); 会导致发起调用时取不到版本（CanaryCarrierItem中getHeaderValue取的是链路context中的内容，而不是header内容）

SkyWalking中异步Executors调用不生效问题排查修复

作者: syf
时间: 2022-11-04
分类: 技术
评论

TL;DR：热更新问题

要增强的目标类ThreadPoolExecutor，在增强前（bytebuddy.installOn → instrument.addTransformer / redefine / retransform）已经被classloader装载到jvm里

导致instrument对已加载的类增强有些问题，这个问题也许可以通过深入redefine或retransform的机制解决，快速且稳定的解决方案为在插件装载完成前不加载ThreadPoolExecutor类

0.前置

instrument中的几个基本逻辑：

`ClassFileTransformer`接口

ClassFile实际指的是java字节码，class文件格式，这个对象在内存中，和文件没有关系

Note the term class file is used as defined in section 3.1 of The Java™ Virtual Machine Specification, to mean a sequence of bytes in class file format, whether or not they reside in a file.

jdk8中接口只有一个transform方法，类在加载、redefined或retransformed时会被调用来增强（the transformer's transform method is invoked when classes are loaded, redefined, or retransformed.）

`redefineClasses`方法

java 5+，已经加载的类重新进行转换处理，即会触发重新加载类定义，需要注意的是，新加载的类不能修改旧有的类声明，譬如不能增加属性、不能修改方法声明

`retransformClasses`方法

java 6+，与如上类似，但不是重新进行转换处理，而是直接把处理结果(bytecode)直接给JVM

“Agents use these methods to retransform previously loaded classes without needing to access their class files.”

redefine和retransform的区别：https://stackoverflow.com/questions/19009583/difference-between-redefine-and-retransform-in-javaagent

1.原始问题

自定义executors插件未生效，调试发现onInstall时抛UnsupportedOperationException，且未打印出来，这里异常栈顶是sun.instrument.InstrumentationImpl#retransformClasses，由SkyWalking调用ByteBuddy时指定了AgentBuilder.RedefinitionStrategy.RETRANSFORMATION策略加载

对比相同使用bytebuddy的自定义agent生效：https://gitlab.sunyongfei.com/platform-basic/java-agents/tree/threadpool-qy/thread-agent

其中onInstall前的RedefinitionStrategy不同：

SkyWalking采用AgentBuilder.RedefinitionStrategy.RETRANSFORMATION，自定义agent采用AgentBuilder.RedefinitionStrategy.REDEFINITION

net.bytebuddy.agent.builder.AgentBuilder.RedefinitionStrategy:

/**
 * <p>
 * A redefinition strategy regulates how already loaded classes are modified by a built agent.
 * </p>
 * <p>
 * <b>Important</b>: Most JVMs do not support changes of a class's structure after a class was already
 * loaded. Therefore, it is typically required that this class file transformer was built while enabling
 * {@link AgentBuilder#disableClassFormatChanges()}.
 * </p>
 */
enum RedefinitionStrategy {
 
    /**
     * Disables redefinition such that already loaded classes are not affected by the agent.
     */
    DISABLED(false, false) {
        @Override
        public void apply(Instrumentation instrumentation,
                          AgentBuilder.Listener listener,
                          CircularityLock circularityLock,
                          PoolStrategy poolStrategy,
                          LocationStrategy locationStrategy,
                          DiscoveryStrategy discoveryStrategy,
                          BatchAllocator redefinitionBatchAllocator,
                          Listener redefinitionListener,
                          LambdaInstrumentationStrategy lambdaInstrumentationStrategy,
                          DescriptionStrategy descriptionStrategy,
                          FallbackStrategy fallbackStrategy,
                          RawMatcher matcher) {
            /* do nothing */
        }
 
        @Override
        protected void check(Instrumentation instrumentation) {
            throw new IllegalStateException("Cannot apply redefinition on disabled strategy");
        }
 
        @Override
        protected Collector make() {
            throw new IllegalStateException("A disabled redefinition strategy cannot create a collector");
        }
    },
 
    /**
     * <p>
     * Applies a <b>redefinition</b> to all classes that are already loaded and that would have been transformed if
     * the built agent was registered before they were loaded. The created {@link ClassFileTransformer} is <b>not</b>
     * registered for applying retransformations.
     * </p>
     * <p>
     * Using this strategy, a redefinition is applied as a single transformation request. This means that a single illegal
     * redefinition of a class causes the entire redefinition attempt to fail.
     * </p>
     * <p>
     * <b>Note</b>: When applying a redefinition, it is normally required to use a {@link TypeStrategy} that applies
     * a redefinition instead of rebasing classes such as {@link TypeStrategy.Default#REDEFINE}. Also, consider
     * the constrains given by this type strategy.
     * </p>
     */
    REDEFINITION(true, false) {
        @Override
        protected void check(Instrumentation instrumentation) {
            if (!instrumentation.isRedefineClassesSupported()) {
                throw new IllegalStateException("Cannot apply redefinition on " + instrumentation);
            }
        }
 
        @Override
        protected Collector make() {
            return new Collector.ForRedefinition();
        }
    },
 
    /**
     * <p>
     * Applies a <b>retransformation</b> to all classes that are already loaded and that would have been transformed if
     * the built agent was registered before they were loaded. The created {@link ClassFileTransformer} is registered
     * for applying retransformations.
     * </p>
     * <p>
     * Using this strategy, a retransformation is applied as a single transformation request. This means that a single illegal
     * retransformation of a class causes the entire retransformation attempt to fail.
     * </p>
     * <p>
     * <b>Note</b>: When applying a retransformation, it is normally required to use a {@link TypeStrategy} that applies
     * a redefinition instead of rebasing classes such as {@link TypeStrategy.Default#REDEFINE}. Also, consider
     * the constrains given by this type strategy.
     * </p>
     */
    RETRANSFORMATION(true, true) {
        @Override
        protected void check(Instrumentation instrumentation) {
            if (!DISPATCHER.isRetransformClassesSupported(instrumentation)) {
                throw new IllegalStateException("Cannot apply retransformation on " + instrumentation);
            }
        }
 
        @Override
        protected Collector make() {
            return new Collector.ForRetransformation();
        }
    };

RedefinitionStrategy的存在是因为“Most JVMs do not support changes of a class's structure after a class was already loaded. Therefore, it is typically required that this class file transformer was built while enabling disableClassFormatChanges().”

即大部分JVM不支持在类被装载后修改，需要指定对这些已经被装载的类如何Redefine策略

open-jdk8 HotSpot VM instrumentation的支持：支持Redefine，不支持Retransform

最上面抛出异常的图调用方法：this.retransformClasses为反射获取的sun.instrument.InstrumentationImpl#retransformClasses方法：

打断点查看native方法的支持情况，原来这里才会懒加载，即上面instrumentation对retransform支持到这里才是正确的：

支持retransform，也就是说执行retransformClasses0(mNativeAgent, classes);原生方法抛了上述异常，此时ThreadPoolExecutor类应该是被加载了的：

验证：自定义agent也改为和SkyWalking一致的RETRAINSFORMATION策略，期望如果也变得不生效，则说明增强时ThreadPoolExecutor类已被装载，且是策略选择问题

结果：自定义agent依旧生效

说明要么自定义agent没有装载ThreadPoolExecutor类，这个策略自然也就对ThreadPoolExecutor类无效；要么可能压根就不是这个问题

验证自定义agent在增强时是否加载了目标类：断点打在apply:4812, AgentBuilder$RedefinitionStrategy

stack:

apply:4812, AgentBuilder$RedefinitionStrategy (net.bytebuddy.agent.builder)
doInstall:9463, AgentBuilder$Default (net.bytebuddy.agent.builder)
installOn:9384, AgentBuilder$Default (net.bytebuddy.agent.builder)
instrumentation:58, ThreadPoolAgent (com.sunyongfei.platform.basic.agent.threadpool)
premain:38, ThreadPoolAgent (com.sunyongfei.platform.basic.agent.threadpool)
invoke0:-1, NativeMethodAccessorImpl (sun.reflect)
invoke:62, NativeMethodAccessorImpl (sun.reflect)
invoke:43, DelegatingMethodAccessorImpl (sun.reflect)
invoke:498, Method (java.lang.reflect)
loadClassAndStartAgent:386, InstrumentationImpl (sun.instrument)
loadClassAndCallPremain:401, InstrumentationImpl (sun.instrument)

发现SkyWalking已经加载了ThreadPoolExecutor，自定义agent没有加载

2.问题定位

通过在ThreadPoolExecutor构造器方法上打断点，定位到加载ThreadPoolExecutor在日志组件中，获取单例FIleWriter时会通过ThreadPoolExecutor创建异步线程

org.apache.skywalking.apm.agent.core.logging.core.FileWriter#FileWriter

private FileWriter() {
    logBuffer = new ArrayBlockingQueue(1024);
    final ArrayList<String> outputLogs = new ArrayList<String>(200);
    Executors.newSingleThreadScheduledExecutor(new DefaultNamedThreadFactory("LogFileWriter"))
             .scheduleAtFixedRate(new RunnableWithExceptionProtection(new Runnable() {
                 @Override
                 public void run() {
                     try {
                         logBuffer.drainTo(outputLogs);
                         for (String log : outputLogs) {
                             writeToFile(log + Constants.LINE_SEPARATOR);
                         }
                         try {
                             fileOutputStream.flush();
                         } catch (IOException e) {
                             e.printStackTrace();
                         }
                     } finally {
                         outputLogs.clear();
                     }
                 }
             }, new RunnableWithExceptionProtection.CallbackWhenException() {
                 @Override
                 public void handle(Throwable t) {
                 }
             }), 0, 1, TimeUnit.SECONDS);
}

3.修复

org.apache.skywalking.apm.agent.core.logging.core.WriterFactory，增加FILE_WRITTER_INIT_FLAG开关，在所有插件和bytebuddy installlOn执行结束前不允许初始化FileWriter，日志只能输出到STDOUT
executors-plugin插件本身在插桩时不能打印日志，否则会死循环递归调用interpretor逻辑，引发stackoverflow等问题

gRPC & netty 初探

作者: syf
时间: 2022-08-24
分类: 技术
3 条评论

gRPC & netty

背景

开发MQ测试环境多版本引流功能，复用RocketMQ的remoting模块netty通信模块，client以近原生的方式请求元信息服务控制消息引流
开发故障检测过程中使用gRPC通信能力，java-gRPC底层默认由netty实现
SkyWalking多版本链路传递功能，学习SkyWalking通信部分设计

gRPC简介

一个高性能、开源通用RPC框架

亮点

protobuf二进制序列化协议，易于定义；
启动快，易于拓展；
跨平台、跨语言，你甚至可以在web端把他当websocket去用；
基于HTTP/2的双向流式传输，高度集成可插拔的鉴权机制；

CNCF孵化中

gRPC使用protobuf作为

从protobuf开始

protubuf是gRPC中的核心概念之一，gRPC使用protobuf同时作为接口定义语言（IDL）和底层消息交换的序列化结构（实际上也可以替换为json等），protobuf中即可以定义rpc（端点），也可以定义数据结构，也支持目录包引用、oneof、enum等特性，protobuf的定义和常用的编程语言数据结构还是有些区别的，例如没有继承关系（protobuf3不再支持extend，实际上可以通过oneof关键字曲线救国）

以下面故障检测服务的其中一个protobuf定义为例，import可以从其他包下引入定义，option则是代码生成的一些选项，service下定义了一个rpc：StreamChannel，入参和出参都是stream类型，表示是双向流；enum和message的数据结构中都由Field var=number的格式组成，其中number的值会在序列化时转成二进制，用来表示字段code。protobuf序列化时number在1~15时占用1个字节，16~2047会占用2个字节，所以最佳实践是把前15的序号保留给最常使用的字段。

syntax = "proto3";

import "command/server/register.proto";
import "command/client/thread_snapshot.proto";
import "command/client/profile.proto";
import "command/server/hot_thread.proto";

package com.enmonster.platform.hts.grpc;

option java_multiple_files = true;
option java_package = "com.enmonster.platform.hts.grpc";
option java_outer_classname = "CommandDispatcherRPC";

service CommandDispatcher {
  // 双向流，server和client可互相通讯
  rpc StreamChannel (stream StreamDataPackage) returns (stream StreamDataPackage) {}
}

// 保证向前兼容，添加命令请勿修改已有命令的field_value
enum Command {
  REGISTER = 0;  // client注册到server
  RECORD_HOT_THREAD = 1;  // 记录热点线程
  // 以上为server端支持的命令，以下为client端支持的命令
  THREAD_SNAPSHOT = 11;  // 采集线程快照
  PROFILE = 12;  // 开始async-profiler采样
}

// 双向流的逻辑数据包，response_body为空表示是请求，不为空表示是响应
message StreamDataPackage {
  string job_id = 1;  // 命令所属的job id
  Command command = 2;  // 命令
  string client_ip = 3;  // 客户端ip, aka node ip
  oneof request {  // 命令请求payload
    RegisterRequest register_request = 10;
    ThreadSnapshotRequest thread_snapshot_request = 11;
    HotThreadRequest hot_thread_request = 12;
    ProfileRequest profile_request = 13;
  }
  BaseResponse response = 20;  // 命令响应payload，不为null表示是命令结果，否则是命令请求
}

message BaseResponse {
  bool ok = 1;  // 是否成功
  string message = 2;  // 消息
  // 命令结果body
  oneof body {
    RegisterResponse register_response = 3;
    ThreadSnapshotResponse thread_snapshot_response = 4;
    HotThreadResponse hot_thread_response = 5;
    ProfileResponse profile_response = 6;
    ProfileResultResponse profile_result_response = 7;
  }
}

大部分字段类型都和常用编程语言类型相似，包括支持map<string, string>表示map、repeated表示数组等，同时也支持复用包里的数据结构，例如google.protobuf.Timestamp

编译protobuf

protobuf的优势在于其是跨平台、跨语言的DSL，比高级编程语言更抽象一级，所以写起来非常简洁（有点像写java接口哈）。这就意味着protobuf+grpc编译后的产物是高级编程语言（java、python、go、c/cpp……），有点像前端的.vue编译成css、js，不过感觉抽象级别更高一些。

这里以java编译为例，引入protobuf-maven-plugin插件执行compile就可以得到编译后的java文件（也可以直接用protoc二进制编译器，实际上这个maven插件也是去执行protoc编译的）

填充自己的业务逻辑吧

编译后的代码中可以看见对应的java类中已经有了完整的gRPC底层通讯逻辑，包括定义的RPC端点，几种同步异步的stub可以直接调用，各种数据结构的builder等，使用时可以非常方便地继承生成的ImplBase类，填充对应handler的逻辑。

可以把protobuf定义单独打到一个maven模块，server和client去引相同的依赖包，以保证版本的一致性和复用。因为是BI_DI类型的rpc，client的逻辑和server就很类似了，初始由client去主动连接server，我这里因为需要上报client的信息做了心跳保活，实际上因为gRPC基于HTTP/2的特性，如果没有显式地设置deadline/timeout，流式的rpc是可以一直传输的，而不用使用HTTP/1.X去轮询或者定时hold长链接。

@see HTC的client代码

SkyWalking是怎么玩的

SkyWalking作为tracing组件，每时每刻都会上报大量的数据到其后端server，除了对SW本身的处理能力、吞吐性有要求外，稳固可靠的传输层/RPC框架也很重要，SkyWalking使用的就是gRPC。社区在8.X后也有了走kafka消息队列的方案，但是至少就我们公司而言，gRPC在链路传输和rpc性能方面已经表现得相当稳固了。

TraceSegmentServiceClient.java

Tracing.proto

RemoteServiceHandler.java

TraceSegmentReportServiceHandler.java

我看的也比较浅，如果感兴趣可以去看SW官方的这篇博客，SkyWalking创始人吴晟写的，介绍了一些通讯、路由、整个系统架构层面的一些内容，比较干货。

有趣的是，RocketMQ在最新发布的5.0版本中也在原来的纯netty通信基础上，选用了gRPC作为默认通讯及rpc方案，由此可见gRPC的高性能和可靠性也越来越被开源社区认可。

站在巨人的肩膀上——netty

既然上面说到gRPC这么强大，除了protobuf作为二进制序列化框架、rpc DSL，以及HTTP/2的底层协议，其他都是自己实现的吗？从grpc-java的角度看，底层还少不了一位重量级角色netty，作为网络层框架。（当然这么说也不是很严谨，实际上除了序列化协议可以由protobuf替换为json等，底层网络传输层也可以由ok-http等替换，这和语言也有关系）

netty简介

netty是一个异步的、事件驱动的网络应用框架，netty的性能众所周知非常强大，资料也非常多，我并没有写过原生的netty，最近在写虚拟环境MQ相关内容时发现RocketMQ的通讯模块对netty封装的也非常好，所以拿过来看下内部是怎么实现的，以及分享下如何基于rocket-remoting模块去拓展MQ client的逻辑，得到和原生consumer<->broker一致的rpc能力。

org.apache.rocketmq.remoting.netty.NettyRemotingAbstract#processRequestCommand

一次堆外内存泄漏问题排查

现象

排查

解决