TL;DR: performance optimization (and the other things, too) will soon need to become much more automated than now.
I haven’t done quite a lot of blogging recently, despite my earlier plans. I have been living in peaceful Estonia for three months now. There’s been a lot of time to think.
In this blog post, I will share my rather philosophical musings about performance optimality. Incidentally, it also applies to monitoring and designing complex systems. There shouldn’t be any hardcore stuff inside, not even low-level optimization of pipelines. Just some ideas I have contemplated.
Getting On The Same Page
With IPv6 and the Internet Of Things coming soon, the amount of data and individual processing nodes will continue to explode. As complexity grows with the ever-growing number of levels of abstractions, new and new classes of potential performance or correctness issues are discovered.
Because the larger the number of states a system can be in, the larger the absolute value of the number of states that are not optimal. But still, the smallest things on the low level can greatly affect high-level behavior. Kind of like a butterfly effect. No matter how hard you try, you will never ever have full correctness and optimality. Because Gödel. But you can always try to do better than you did earlier.
To illustrate this trend, about 10 years ago the ideas of a Performance Engineer and of an APM came around. Then, about 7 years later, as the scales grew further, a notion of a Site Reliability Engineer appeared.
But it is just the beginning.
More Complex Than Us
I find it likely that within a few years, a human mind will become literally unable to comprehend the whole complexity of a system. A single human’s mind will be unable to both fully understand the low-level root cause of a particular problem on a given execution node, and the full impact that this particular outage has on the system’s high-level behavior.
While the narrow-down-to-the-execution-node-level thingy is not very hard to automate, It’s usually humans who take it from there. But in the nearest future, the root cause and the fix will need to be found faster than humans can do.
It is not possible to offer a flawless solution, but we will have to do something. So, the first level of defense here is an automated tool that detects and even predicts some of the possible issues. To succeed in it, we will need a knowledge base, of sorts. Common low-level problems that often lead to visible high-level issues and the symptoms of such problems.
What’s Happening Now
One of the tools that aim to automatically detect root causes is Plumbr — a java performance monitoring solution that can detect, predict and suggest fixes for memory leaks, contended monitors, long GC pauses, and more. That’s where I work, so to be fair, another example would be jClarity Illuminate that, according to the web site, is able to detect over usage of system resources, application code inefficiencies, GC problems, and more. There are probably many more tools that I have not mentioned, too. Add them in if you know any, I would be glad to find more people working on the same problem!
What’s To Happen
Going further, what we really need in order to make progress, are the tools that enable us to research high-level changes in application’s behavior depending on low-level changes in an application’s individual node setup or code or whatnot. Only armed with these will we be able to fill up the knowledge base of the potential issues.
A simple example of such problem (and incidentally the same one that has set me on this train of thought) is minimizing the overhead of Plumbr. To minimize it, we need to be able to measure and monitor it. But to do that, we require to have a model of how our agent’s actions influence the performance of different types of applications.
At the moment, there’s no customary way of doing this, so I had to build a more macro-level tool that helps analyze the changes in the performance behavior of various workloads with and without the Plumbr agent. Actually, it immediately generalizes to just analyzing the differences in performance with various JVM args, and is something that the community can make use of, so it is likely (but not certain) that it will be opensourced as a small contribution to the things to come.
This might all seem like abstract stuff, but I really do believe that the future will look like this. And I’d like to take part in making it all happen. Also, the above is definitely very concrete compared to the below:
It has also occurred to me that performance optimality is a Universal Instrumental Value. I am not referring to Roko’s ideas of promoting “Universal” Instrumental Values to Terminal Values, but am using this term to describe things that a supermajority of agents would do instrumentally. It seems pretty obvious that any sufficiently rational agent will be interested in ensuring that its utility function is being fulfilled in an optimal way. And, quite obviously, the time that it takes to get some utilons is a factor in that.
An AGI is a great example to illustrate my reasoning. To build so complex of a system that performs well, we would almost certainly require tools far more advanced than those that we have now. For this particular system, though, correctness is of much more importance than performance. Because would you rather have a superintelligence which requires an extra month of thinking before finding a cure to cancer, or a superintelligence that dropped nukes because of a programming error?