For one of our clients, we have recently developed a software system that included transforming PDF files into their text representation (PDF parser). The system consists of several components, but for the purpose of this post we will be interested in two of them - Web scraper and PDF parser. Web scraper is a Go program that retrieves information from websites and also from PDF files, when it encouters them. If you ever tried to programmatically extract information from PDF files, you know that it is a non-trivial issue and there exist only a handful of usable libraries. Unfortunately, we did not find a usable Go library, however there is a Java Apache PDFBox library that produces quite satisfactory results. We just had to solve how to call Java from Go:
cgo
. We did not want to go down this path.exec.Command("java", "-jar", ...)
and capturing the output. This was a no go mainly because slow JVM startup.The goal was clear, we are going to publish an API which will then be called by the Web scraper whenever it is necessary to get back a text representation of a PDF file. Something similar to a REST API with JSON payload was dismissed right away. Because this was pure server side integration we decided to give gRPC a try. gRPC is an open source RPC framework with support for multiple programming languages and it uses binary, schema-based serialization through Protocol Buffers. Protocol Buffers proto3 language is used to describe both, service interface and structure of the payload messages. As a part of the Protocol Buffer distribution (together with gRPC plugins) there is a tooling that can generate RPC client stub and server skeleton in the language of your choice.
The API for the PDF parser is relatively simple - the binary content of the PDF file serves as the input, and the output is the text representation of a PDF file. Written in proto3 it looks like this:
syntax = "proto3";
option java_multiple_files = true;
option java_package = "eu.redbyte.pdfparser.grpc";
option java_outer_classname = "PDFParserApi";
package pdfparserapi;
message ParserRequest {
bytes content = 1;
}
message ParserResponse {
string text = 1;
}
service PDFParser {
rpc Parse (ParserRequest) returns (ParserResponse) {
}
}
As I mentioned earlier, Protocol Buffers distribution contains the tooling to generate code from proto3 schema. In our Gradle based build we used protobuf plugin, which takes care of compiling the proto3 schema to the Java classes. The relevant parts of our Gradle build are:
plugins {
id "com.google.protobuf" version "0.8.6"
// ...
}
// ...
protobuf {
protoc {
artifact = "com.google.protobuf:protoc:3.6.1"
}
plugins {
grpc {
artifact = "io.grpc:protoc-gen-grpc-java:1.14.0"
}
}
generateProtoTasks {
all()*.plugins { grpc {} }
}
}
Not considering the implementation details of the PDF transormation, the whole PDF Parser service is really simple:
@GRpcService
public class PDFParserService extends PDFParserGrpc.PDFParserImplBase {
private PDFExtractor pdfExtractor;
@Autowired
public PDFParserService(PDFExtractor pdfExtractor) {
this.pdfExtractor = pdfExtractor;
}
@Override
public void parse(ParserRequest request, StreamObserver<ParserResponse> responseObserver) {
try {
String text = pdfExtractor.extract(request.getContent().newInput());
ParserResponse response = ParserResponse.newBuilder().setText(text).build();
responseObserver.onNext(response);
responseObserver.onCompleted();
} catch (Exception e) {
responseObserver.onError(e);
}
}
}
If you use the gRPC spring boot starter as we did, you can annotate your Java class with @GRpcService
for gRPC service autoconfiguration. When you run the app the gRPC service is automagically up, listening on port 3000.
The PDF extraction service is up and listening on localhost:3000
, so let’s generate a Go client. Using the protoc
compiler and plugin for Go, we can generate the client’s stub as follows (in our case it is a Makefile target):
protoc -I grpc/pdfparserapi grpc/pdfparserapi/pdfparserapi.proto --go_out=plugins=grpc:grpc/pdfparserapi
The actual use of the generated Go client stub is straightforward:
// gRPC client initialization
conn, err := grpc.Dial("localhost:3000", grpc.WithInsecure())
if err != nil {
log.Fatalln("unable to connect to localhost:3000")
}
defer conn.Close()
pdfParser := pdfparserapi.NewPDFParserClient(conn)
// ...
// gRPC client usage
response, err := pdfParser.Parse(context.Background(), &pdfparserapi.ParserRequest{Content: pdfContent})
if err != nil {
return errors.WithStack(err)
}
// use the response
fmt.Println(response.Text)
Using a relatively simple example, we have shown the use of the gRPC framework as a (more pragmatic) alternative to the “classic” REST API approach. And we only scratched the surface of gRPC features. We did not discuss topics such as streaming, authentication, TLS, or using middleware (for example, for rate limiting).