How to deal with non-reproducible bugs in Java web applications?

Non-reproducible bugs is probably the most challenging category of software bugs. To solve such kind of a problem you need to be like a Sherlock Holmes and use all your knowledge and experience to put all pieces of the puzzle together. In this blog post, we’ll show you how we squashed real-life bug from that category.

Introduction

In one of our healthcare projects we’ve been integrating with a government service for collection, analysis, and sharing of medical information about patients, and because it’s very sensitive data whole communication had to be encrypted and secured with digital certificates. Since it was a SOAP Web Service we decided to use Apache CXF with Merlin as a Crypto interface implementation because it’s a reliable and widely used tool in the Java world.

Once the solution was delivered to the customers and deployed onto production in the next few months medical practitioners were using it in their everyday work without any issues. But one day our monitoring dashboard in New Relic started getting red. There were errors in communication with that government SOAP Web Service. Heavy cup of coffee and we were ready to start the investigation.

Investigation

Doing the logs analysis we’ve found the error:

org.apache.wss4j.common.ext.WSSecurityException: The private key for the supplied alias does not exist in the keystore
	at org.apache.wss4j.common.crypto.Merlin.getPrivateKey(Merlin.java:767)
	at org.apache.wss4j.dom.message.WSSecSignature.computeSignature(WSSecSignature.java:548)
	at org.apache.wss4j.dom.action.SignatureAction.execute(SignatureAction.java:154)
	at org.apache.wss4j.dom.handler.WSHandler.doSenderAction(WSHandler.java:234)
	at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor.access$100(WSS4JOutInterceptor.java:54)
	at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessageInternal(WSS4JOutInterceptor.java:261)
	at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:135)
	at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:122)
	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
	at org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:514)
	at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:423)
	at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:324)
	at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:277)
	at org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:96)
	at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:139)
	at .....
Caused by: java.security.UnrecoverableKeyException: Get Key failed: Given final block not properly padded. Such issues can arise 
if a bad key is used during decryption.
	at sun.security.pkcs12.PKCS12KeyStore.engineGetKey(PKCS12KeyStore.java:455)
	at java.security.KeyStore.getKey(KeyStore.java:1023)
	at org.apache.wss4j.common.crypto.Merlin.getPrivateKey(Merlin.java:756)
	... 193 common frames omitted
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded. Such issues can arise 
if a bad key is used during decryption.
	at com.sun.crypto.provider.CipherCore.unpad(CipherCore.java:975)
	at com.sun.crypto.provider.CipherCore.fillOutputBuffer(CipherCore.java:1056)
	at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:853)
	at com.sun.crypto.provider.PKCS12PBECipherCore.implDoFinal(PKCS12PBECipherCore.java:405)
	at com.sun.crypto.provider.PKCS12PBECipherCore$PBEWithSHA1AndDESede.engineDoFinal(PKCS12PBECipherCore.java:437)
	at javax.crypto.Cipher.doFinal(Cipher.java:2168)
	at sun.security.pkcs12.PKCS12KeyStore.lambda$engineGetKey$0(PKCS12KeyStore.java:396)
	at sun.security.pkcs12.PKCS12KeyStore$RetryWithZero.run(PKCS12KeyStore.java:285)
	at sun.security.pkcs12.PKCS12KeyStore.engineGetKey(PKCS12KeyStore.java:390)
	... 195 common frames omitted

As our application was deployed on Kubernetes with horizontal scaling we checked other instances of that microservice and their corresponding logs. We were really surprised when realized that this signing error was occurring only on this specific instance. Moreover, the issue couldn’t be replicated neither on development nor on the staging environment.

So how could we find out the root cause of the issue?

The only clue we had was the stack trace we found in logs, so we’ve taken a closer look at it again!

Our investigation started by attaching the debugger to the Docker container. We’ve chosen the following line to put a breakpoint at and it was really surprising when we actually didn’t hit it:

sun.security.pkcs12.PKCS12KeyStore.engineGetKey(PKCS12KeyStore.java:390

So essentially we discovered that there was a different execution path on our local machine. To confirm that and find the diverge point we were debugging the whole stack trace step by step. As we can see from logs, the exception was thrown at

org.apache.wss4j.common.crypto.Merlin.getPrivateKey(Merlin.java:756) 

so we stopped the debugger at this line and investigated the state of program variables. It was a good call as we noticed that KeyStoreSpi implementation was different from what we observed in the stack trace. But why was it different?

The answer was hidden inside the sun.security.jca.ProviderList.getService(String type, String name) method which source you can see below:

    /**
     * Return a Service describing an implementation of the specified
     * algorithm from the Provider with the highest precedence that
     * supports that algorithm. Return null if no Provider supports this
     * algorithm.
     */
    public Service getService(String type, String name) {
        for (int i = 0; i < configs.length; i++) {
            Provider p = getProvider(i);
            Service s = p.getService(type, name);
            if (s != null) {
                return s;
            }
        }
        return null;
    }

This method returns the first Java KeyStore provider that supports the claimed pkcs12 algorithm, however, the repetitive order of the provider list is not guaranteed. In our case, we registered Bouncy Castle as an additional provider, but sometimes it wasn’t registered at index 0 and that was causing problems.

Solution

Fortunately, with the help of Apache WSS4J documentation we’ve found a solution, wherein you can set KeyStore provider explicitly.

        Properties sig_props = new Properties()
        sig_props.put("org.apache.ws.security.crypto.merlin.keystore.provider", 'BC')
        // another signature properties...

        Map<String, Object> outProps = new HashMap<String, Object>()
        outProps.put(WSHandlerConstants.ACTION, WSHandlerConstants.SIGNATURE)
        outProps.put("properties", sig_props)
        outProps.put(WSHandlerConstants.SIG_PROP_REF_ID, "properties")
        // another out interceptor properties properties...

        Endpoint cxfEndpoint = ((Client) wsClient).getEndpoint()
        cxfEndpoint.getOutInterceptors().add(new WSS4JOutInterceptor(outProps))

Summary

Troubleshooting non-reproducible bugs in Java-based web applications it’s a hard task. It may consume a lot of time and make you very frustrated. Remembering few basic rules may save you hours of work and few white hairs:

  • gather and analyze application logs
  • monitor your application at runtime
  • read stack traces carefully
  • debugger is your friend!

Happy debugging!

Leave a Reply

Your email address will not be published. Required fields are marked *