Use Protobuf and gRPC to modularize a analyzer framework

The key problem of developing a “modularized” analyzer framework is that different modules have various dependencies. For example, a frontend relies on a off-the-shelf parser library to strip out the syntax structure of source texts; a datalog analysis needs an engine to interpret and execute it. Mixing them may result in unexpected depdency problems, both when developing and deploying. Moreover, it adds complexity if we want to use a different language to implement some modules - e.g. use C++ to corporate with LLVM.

When developing my framework, Interface Description Languages like Protobuf provide a easier way for cross-language modulariation. Using an IDL to shape the communication eases my nerve when implementing new analyzer components. Every components just read from the generated data structures and fill in them. It can be used to define key shared structures like IRs, too. It also reduces my workload when updating message formats. Different from JSON or other text-only methods, the generated code can leverage language features to detect outdated interface implementations. (On the other side, most IDLs are capable of corporating with older yet compatible implementations).

When deploying, the whole power of IDL reveals when combining it with an RPC framework. RPC frameworks like gRPC provide easy and efficient ways for communicating among components. With gRPC, I seldom headache on how to communicate among different modules. Each component is depolyed as an isolated server. Wiring all servers up with configured addresses, each server can communicate with each other with generated APIs. A step further, servers can be deployed into containers to totally isolate their runtime dependencies.

Thus the repo structure looks like below. Now the analyzer framework looks quite like a microservice system. I prefer the monorepo manner, as the IDL descriptions update often.

- root
  - ir/                 { IDLs for shared data structures }
  - protos/             { IDLs for component interface and commnunication }
  - settings.gradle     { use Gradle to organize Java code }
  - java-commons/       { common utility for Java modules }
  - java-module-1/
  - java-module-2/
  - ...
  - CMakeLists.txt      { use CMake to organize C++ code }
  - cpp-commons/        { common utility for C++ modules }
    - include/
    - libsrc/
  - cpp-module-1/
  - cpp-module-2/
  - ...
  - java-core/          { coordinator of handling analysis requests and scheduling analysis workflow }
  - python-client/      { Python-impelemented demo client }
  - scripts/            { scripts to launch the framework locally }
  - Dockerfile          { use Docker to deploy }
  - docker-compose.yml

But there are some constraints. It is less efficient for RPC frameworks to pass large data structures. In my case, the Datalog relations shared between components are usually too large to directly pass by sockets. Temporarily, I coped with it by caching relations in local files and passing file paths only in the interface, and then I have to mount a folder in host filesystem to all dockers with the same path to sync the files. This prevent the framework to be more de-coupled. It would be better to use a webfs to share files with unified uri, or use a DB server to cache relations instead of BDD files.

In general, using Protobuf with gRPC, or some other IDL+RPC tools like Thrift, eases a lot of effort when developing multi-language framework.