Machine transcript
Lovely. Good afternoon, ladies and gentlemen, it's wonderful to be here with you today,
my first GeeCON, I'm excited to be here. I'm here to talk about testing, but before
I start, allow me to introduce myself really briefly. My name is Jan Stępień, I'm based
in Berlin, where I work at INNOQ, we're a consultancy, and as a consultant, I help people
build better software, sometimes as a software developer, sometimes as an architect, or a
trainer.
Enough about me, let's talk about you. My goal for today is to make you better
at testing. I'm sure that you write excellent tests, and I want you to keep writing those
tests. I want to give you another tool, another very interesting tool which can help you tackle
those most difficult cases, those really special systems under test which require particular
attention.
Let me start with a general question. Why do we test in the first place? Because
we don't need to, but we do anyway. One could say it's because of quality, quality of the
software we build, but I think that's nonsense. We can manually go through user stories, click
through the interface, and see that the database produces the right results and saves the right
data, and the quality is there. But we don't do it because we don't have time. So, in our
case, testing is all about automation. We want to automatically get the same assurances
but without any human input. And we automate so many things already, right?
We automate
our builds in our CI pipelines, we automate our provisioning with Terraform, Ansible,
you name it, and then finally, using continuous deployment, continuous delivery, we automatically
push all this stuff to production and it just works. It does so, among other things, thanks
to testing. Just like Fred George said yesterday, the day before, you won't deploy 600 times
a week if you don't have all those steps automated and taken care of, right? This is a prerequisite.
And tests belong to this group. So let's talk about the story of automation and maybe quality,
keeping Slido in mind. I would love to have your questions. Ask questions, upvote questions
of others. And, yes, let's talk about fitness tracking. This is our system under test. Back
when I was running, I used to have such a thing.
This is what I call business logic.
It's not a whole lot of it. It's a very simple case. I'm using raw numbers. But we ended
up with such a function, and now it's our task before we start, say, refactoring, for
example, we need some tests to make sure that the existing behaviour remains in place and
nothing falls apart.
So, we can write a handful of tests. For example, those three points
might be good enough. I'm ignoring the divide by zero case because it's not really meaningful,
so it's not a behaviour we want to test for and explicitly preserve in the refactoring.
So we look at those three cases and we're like, this might be enough, and if it's not,
how many more tests would we need to be certain that our system is sufficiently well tested?
Ten more, a hundred more, a thousand more? It's difficult to say. And also, what led
us to those particular values? Are they special in any way? Particularly interesting? Do
they literally appear in user stories?
Let's move things around a little bit. I extracted
all the values into a single collection, a single table, a list of inputs and expected
outputs, and then the assertion is completely decoupled from actual values. They are fed
from outside.
Now, with this thing, I want us to take a step back and talk to a domain
expert in this, in the world of fitness tracking to gain a better understanding of what we
are actually testing. Our domain expert will be our primary school math teacher.
A over
B equals A over B. If we multiply both sides by B and simplify the right side, we end up
with the following property. That's a key word for today. That's a property. We got
it thanks to a conversation with an expert who knows the domain better than we do as
software specialists.
Now we can take this property and put it in this piece of code
we wrote before. Notice what is changing. Something has changed inside of the assertion.
That's cool. But what is far more interesting is the fact that we don't need return values
of this function any more. We don't care what those return values are. We just keep feeding
inputs, and for arbitrary inputs, we should always - we should always see this property
being satisfied. So, what would prevent us from generating more and more completely arbitrary
inputs, hopefully exercising various exciting edge cases, and see whether the property
holds for all of them?
Exactly this reasoning, this approach is called generative or property-based
testing, where we subject our properties to large volumes of automatically generated
inputs. The approach is 20 years old. It was published in 1999. The first library implementing
it was called QuickCheck. It comes from the functional programming world, but it quickly
moved to other domains and got implemented in other programming languages.
In Java, we
have at least two libraries implementing this principle. Let's take a look how this
property, this assertion, would look like when expressed in those libraries. The first
one is JUnit QuickCheck. It looks like nearly like normal JUnit, but instead of having this
@test annotations, we have those @property annotations. It looks a bit different because
those functions have parameters. Thanks to this @property annotation, the test framework
generates values having read the types of arguments, generates arbitrary values of those types and
feeds them into our function, running it, I think, by default 100 times, but it's a
configurable parameter, so, if you want to be really sure, you can run it overnight on
your CI.
That's one thing. Another library which I like a lot because it's a bit less
annotation-heavy and more explicit and reads like English is QuickTheories. It looks like
this. In QuickTheories, we can say for all integers which are positive, and for any double,
check that the following holds. That's exactly what we wrote in plain English a couple of
slides before.
So, we see we have built-in generators of values, simple numbers. We have
also generators of strings, and we can say I want only ASCII strings, or I want alphanumeric
strings, or some very exciting strings. Now let's run this test against our function.
We end up with an error which is not all that surprising, right? If we zoom in, we
see that it's an assertion error. The assertion of the equality check has failed. We see
those two values which were fed into the function which led to the problem. One and zero. The
smallest found falsifying value. For those values, something doesn't work. We expected
one but not a number. We can't change mathematics. All we can do is either work with our system
under test, or our property, and try to somehow narrow it down so that it generates values
only within our domain. Values legal in our world.
Let's return to the property and take
a look at this generator of doubles. It's not any double. It's any double which is not
zero. After a chat with our domain expert, we decide that a ten is a good value to begin
with. Nobody wants an average for less than ten seconds of steps.
So, with ten in place,
we can say that we want all doubles between ten and one million, for example. Or, an alternative
approach, we can return to the previous version and add a call to assuming which takes a predicate.
Within two values which are generated, it says, yes, those values are okay, or it says,
no, we need to generate other values because those, for any reason, are not something we
can run the assertion on. This way, we narrow down the domain, in this case, the domain
of doubles to what we need. Now, we run this again, quite certain that this time the assertion
holds because why wouldn't it? It doesn't. Our testing library very quickly found some
very exciting values, ten and a tiny bit for which the equality doesn't hold because one is not
nearly one.
At this point, there's not much we can do, right? We hit something as fundamental
as before. If we fire up our browser console and add two numbers, we realise that computers
and mathematics have very different understanding of addition. There's nothing we can do. We
just have to, again, work with our property and say, well, maybe a better check would
be, say, falls within certain margin of error.
This way, we keep working with our properties,
making them actually describe as accurately as possible the domain that we're testing.
We already see several new degrees of automation. We don't have just automatic execution of
tests. We have automatic generation of values which our properties will be subjected to,
and also automatic generation of edge cases, because we as humans, as developers, are really
bad at testing our own code. We are too optimistic. We need somebody to tell us there is something
in between you haven't thought of, and this tool excels at it, automatically finding mean
combinations of inputs which might result to falsifying our assumptions.
But testing is
not only about automation, right? Those of us who do test-driven development know that
testing is also about design. Having our tests drive us towards an elegant design of the
software. And property-based testing can be also very efficiently, very effectively used
for test-driven development. Let's consider a different example, also a very simple one.
Reversing lists, and by lists, I mean immutable lists—my regards to Grzegorz Piwowarek.
If we want to test drive an implementation of such a simple function, we would also start
with a property. Let's start with lists of longs. And think about the property which
will hold for all the lists of longs which are between zero and ten elements.
What would
we check for? What kind of property holds for reverse? There is no single right answer.
There are several good ideas. What I like to begin with is the fact that if we reverse
twice, we get the same thing back. That's a very nice cycle. It looks silly, but it's
quite valuable because it already conveys the fact that reverse must preserve all the
elements of the list in between calls. All right? Just two calls, just one assertion,
but already quite meaningful property. Let's implement it. Our test is completely red because
there is no code, but let's make this test green by implementing a valid version of reverse.
So we have an entirely wrong function which fully satisfies our properties. So we need
to continue narrowing it down. Think about another property which would make sure that
this is not legal.
Another thing which I thought of is the fact that if you reverse a list
which is not empty, the first thing becomes the last thing. So we can express it like
this. If we have lists of size between 1 and 100, after reversing, the last thing is equal
to the first thing. It turns out that if you run - if you add this property and run
the code, it's surprisingly difficult to write such a reverse function which is invalid.
But let's run it first. We see that our testing library has found a falsifying case, and
look how simple it is. It's just one, zero. It's an example of a value and input list which
falsifies our assumption. But it's not the only example value which our test framework
has considered.
If we scroll down, we will see that a whole lot of other values were
considered. The one at the bottom is the one which the test library stumbled upon and said
this is the first thing which doesn't satisfy the property, or in fact, all the properties
we have here. It said, well, I can give you just that, but instead of giving you something
which might be large and difficult to interpret, it begins to reverse the operation of generators
and begins so-called shrinking. It reduces the magnitude of all the inputs, so, like,
making lists shorter, making numbers smaller, making strings simpler, and tries to generate
a minimal example which still breaks your assumption, which breaks your property. Which
is, again, automation is exactly what we would do.
When we stumble upon a problem, we try
to think, okay, so what actually in this piece of data and the state of the system has led
to the problem? This will automate this process, trying to find it for us, and then it gives
us 1, 0 as a result. This is a minimal thing which breaks your assumptions, and that's
something we can work with.
It's not as impressive when we're working with lists of ints and
strings, but consider the fact that given simple generators of numbers and simple generators
of strings, we can combine those and generate arbitrarily complex data from our domain,
like, say, a customer which is making an order in our shop. When you look at it, and
after you serialise it as JSON, it's just strings and numbers. Some strings looking
like dates. In the end, it's all plain data.
By combining generators, you can generate
arbitrarily complex instances of objects from your domain, and then subject them to
various properties, discovered together with your domain experts. I think since we've already
stumbled into the world of e-commerce, allow me to bring an example.
A while ago, I worked
with a German e-commerce website, and, as customers browsed the catalogue, they ended
up on those pages with those complex addresses representing all the potential filters you
can apply to the databases. Those URLs were parsed into objects representing queries
which would be then sent to, say, Elasticsearch. The logic for, firstly, parsing those URLs
into filters to apply to the database, but even more so for generating valid URLs, this
was a source of regressions, endless complexity, and problems.
There was a lot of stuff going
on there, because we had both like your standard complexity of mapping one structure to a string
representation. We had all the constraints and requirements from SEO department which
had a very clear vision how those URLs should look like, which ones are legal, which ones
are not. Finally, this website was running in a couple of countries, so we had it all
internationalised per country.
We knew we needed to rewrite this module. It wasn't
clear how and what to begin with because there was no reasonable test coverage. We decided
before we start the refactoring, we have very solid coverage of this system. This is when
we reached for property-based testing. So we looked at a problem and said what kind
of property can we discover, can we express using our property-based testing libraries?
And the property is here on the screen. You've seen it before. It has something to do with
the reverse function. If you generate an arbitrary valid descriptor, the object on the bottom
of the slide, you generate a string out of it and you parse the string again, you have
exactly the same object.
So we have this cycle again. And what we did, we expressed exactly
this property that, for all arbitrary search filters descriptors, if you generate a URL
and parse it, we have a quality check. This single property has been an endless source
of caught regressions, fixed bugs. It helped us immensely during the refactoring. We even
were able to come to our colleagues from the SEO department and say, "We can't implement
it. We're sorry." Given this requirement and this requirement, we have this specific
data structure, this specific search, and we see a conflict that those two are mutually
incompatible. This was just fantastic.
Also, the understanding of rules, our colleagues
who work with the domain defined, and the test framework which assisted us in encoding
those rules, and also how it drove our implementation. We started with a minimal generator which
generated just simplest possible search terms, search data structures, and, as we got this
thing right, we kept adding more and more complex generators leading to discovering
various interactions between parts of this logic.
It was just an excellent experience.
This leads me to calling this property-based testing an excellent, maybe not replacement,
but something which accompanies test-driven development very well, where you drive your
tests, where you drive your implementation using properties. This leads not only to better
design, but also to a better understanding of your domain, because, to come up with a
couple of examples, and encode them as a single example-based test, it is relatively easy
compared to how much you have to invest in terms of time and understanding to work with
your colleagues and find general properties spanning the entire system. You understand
the domain you work with far better through this process.
Okay, very well. So we've already
seen some basic principles. As we keep Slido in mind, let's notice that, so far, we were
only looking at systems which were far remvoed from, so to speak, real life. There was nothing
moving inside. There was no change happening. There was no time affecting the system, no
mutable state. There was no real change of the state happening. Can property-based testing
assist us in testing stateful systems which change as time passes? It turns out, the approach,
the property-based approach is surprisingly effective at testing such complex systems.
As we discussed, we can generate arbitrary instances of our business entities of our
domain objects. And if we can generate those instances, we can also generate events describing
change to those instances. Think of all the event-driven architecture talks and keynotes
we've had so far. We can generate an event which describes something that has changed
your system, and if we can generate a single event, we can generate a sequence of them,
apply them to the system, and see whether the system satisfies our properties, satisfies
certain criteria, depending, for example, on the ordering of events, or the fact that
when one event got executed, another one suddenly cannot be reliably executed in the same way
had that previous one not been there.
Let's illustrate this approach with a simple example.
Let's implement a facade for Redis. A simple cache where we can save some data and get
some data with a handful of operations, setting a key, getting a value under that key, deleting
a key, and clearing the entire cache. In order to encode those operations as data, we need
to somehow encode the idea of an operation and then simple classes, simple plain old
objects describing those changes.
We have an operation, and then a set class which sets
a given value at a key, get, allowing us to obtain it, deleting a key, and clearing the
entire database. And using generators of sequences, generators of strings, we can generate arbitrary
instances of those operations as well as entire sequences of operations which will
be executed. And then we need some sort of interpretation mechanism which will take those
sequences and interpret them, run them against our system under test, against our class with
a Redis backing it somewhere on the network. And now, the Redis is in some state
after this run, and those operations brought some results, returned some values.
What we
need now is an oracle, and by oracle, I mean this medium, this person who stands between
us and divine entities whom we can ask is our implementation rubbish? And we run our
tests, we run our sequence of operations against Redis cache and our oracle being java.util.HashMap.
And HashMap is a simplified but fairly correct model of the system under test. It models
it without all the complexity, and we can trust it to be correct. Right? We run both
versions, and if - and we define certain post conditions.
For example, every single operation
has to return the same result as it does against HashMap. And if it doesn't, then the test
fails and shrinking begins, and the test framework begins to shrink and simplify our sequence
of operations, resulting, yielding the smallest possible sequence leading to a problem. Right?
Simplification of multi-step debugging of a complex system.
If
you want to learn more about this approach, I heartily recommend this presentation, testing
the hard stuff and staying sane. John Hughes is one of the authors of property-based testing
and works a lot with this method. He's founded a company which is providing services around
property-based testing and speaks a lot at conferences.
He's a brilliant guy. His talks
are entertaining because he takes examples from the industry, and they're frightening
sometimes. He's contracted, say, in this talk, I don't want to spoil too much because it's
a good talk, but he's describing a contract where they were testing internal communication
bus in cars and automotive industry, and they discovered that the protocol connecting various
subsystems of a car, and there is a lot of them these days, have problems of prioritising
messages, so you really, really shouldn't fiddle with your volume as you're breaking.
It's just beautiful.
And it's all thanks to, you know, tools where various sequences of
complex operations are ran against such a massive system, and then minimised into very
real problems. Yes, so watch this stuff and watch other things which John Hughes has to
offer. In the meantime, we talked about state, right? State is exciting on its own.
Things
are changing. The system is evolving. But state becomes really, really interesting when
it's concurrent and multi-threaded. And this is when things get really complex because
we have absolutely no idea what is going on. And this might lead to deadlocks. This might
lead to unexpected values. This might lead to race conditions, problems which occur when
actions executed concurrently aren't isolated from one another and affect one another, where
a sequential execution of actions wouldn't lead to a result which we have observed in
a concurrent environment.
And, again, a question, can property-based testing help us testing
concurrent stateful systems? Let's think at what does it mean to have a concurrent stateful
system? It means that we have, for example, two threads, the Greek thread and the Arabic
thread, and they execute four operations each at the same time concurrently. We have no
idea what the order will be. We have a single system, for example, Redis cache, and we
execute all those operations from two threads.
The system ended up in some state. But we
know that if those operations, if those two threads have no - don't have any impact on
each other, don't affect each other's execution, the resulting state should be the same as
if those operations were executed in some interleaving. This individual subsequences
are preserved, the order is preserved, but we don't know how individual threads interleaved
their execution.
I hope everybody is following me so far because it is getting more exciting.
We have our case with two threads, and we put those two sequences of operations we generated
of our framework, put them into two threads which are waiting behind some kind of lock,
and then we say go. Those two threads execute those operations, and something happened,
and we ended up in a state S. Our Redis cache is in a state S. Those two threads yielded
values; we can call them S.
Now, we know that if there were no interactions between those
two threads, then those operations were interleaved somehow. So S must be either the result of
execution in this order, or the result of execution in that order, or many other possible
orders.
The number very quickly gets big. So we know that sequential
executions have brought us to those, in this case, nine states. In practice, many, many
more. If S is not a member of this state, of this set, if we can't find S among sequential
executions, we know that the result of the concurrent execution has been affected by
the concurrent environment. There was some kind of race condition, some kind of interaction
between those threads which led to an unexpected result.
This is again something which John
Hughes talks about, testing next-step concurrent stateful systems. It's fascinating because
he shows in another talk of his fascinating examples of complex systems such as thread
pools and distributed databases which, depending on certain interleaving of operations, yields
inexplicable results.
With tools such as these, you can find those cases. Of course,
one important thing is we have to remember that we are dealing with non-determinism.
We have no idea whether we will see that S ever again because there is a chance that
such an interleaving of threads, such decisions of our CPU scheduler will never happen again.
So, when we begin to shrink those sequences of operations, what happens if, for example,
in the next step of shrinking, we will never see the same sequence again, we will just
never see the same problem again? Again, what John Hughes advises here is simply just run
the test ten times. It's more than likely that one of those executions will manifest
the same problem.
On JVM, we have - JVM is a wonderful beast but it has certain limitations.
For example, we don't really have such a fine-grained control over threads and their execution as
on other platforms. For example, if you work on top of the Erlang virtual machine, you
can control threads at a far finer level and actually force the virtual machine to
perform certain interleavings of those threads which makes this process far faster than running
it ten times and hoping that it will work. But still, in any case, we're standing on
the shoulders of giants.
I would like to illustrate this approach with another example. Who here
knows Jepsen and Kyle Kingsbury's work? I see some hands. This is just brilliant. Kyle
Kingsbury is a man who is grilling databases.
What he does is he takes some kind of distributed
database we all here know and trust, and he reads carefully the documentation of the database.
What happens if problems occur, if networking problems occur, if there are glitches, if
suddenly half of your cluster is disconnected from the rest? Then he reads the guarantees
which authors of the database offer in such conditions. Then he replicates exactly those
conditions in his test cluster which he has full control of because it's a bunch of virtual
machines.
He injects glitches into the network connecting database nodes. He's injecting
delays, other problems, half of the cluster is invisible, and so on. Then just generates
sequences of simple operations. It looks whether the result after the partitions are healed
after the database is back in a stable state, whether the contents of the database can be
explained by what's there in the documentation.
Kyle has given a number of talks and wrote
some excellent very deep technical articles about various databases. I don't want to name
but you know them and you use them. I can only recommend reading those and realising
that sometimes we have to trust, but sometimes we should also verify. It's really fantastic
what is happening. Such, in principle, simple methods, how far they can bring. This is nowhere
as powerful as a formal proof that something is correct or not. It's just firing random
tests and more often than not, you find problems which occur in such settings.
My friend, I
want you to be better at testing. Your tests as they are, they totally work. But keep in
mind that if you run into some really complex problems, really demanding, challenging systems
under test, there are tools for that. There are tools which allow you to automate not
only testing itself but also generation of various mean inputs to your logic. But also
automate the process of reducing those failing cases to a minimal thing you can work with,
you can debug. It's not five kilobytes of JSON. Good luck, debug this. No, you have
a minimal thing and you know that everything in this thing had somehow impact on the invalid
state of the system.
And as Daniel told us in his talk just yesterday, automate all the
things which computers can do for us. This is such an excellent example of things which
computers can do for us with advanced enough tooling. And also, keep the design in mind.
The fact that simple properties, few properties can already narrow down our design and help
us implement elegant systems just like we would do using test-driven development.
But
in this test-driven cycle, as we write a property and a simple generator just covering a small
chunk of a domain, we hope we got our implementation right and the property says no, no, no, here's
another example, now fix this, now fix that. We have more and more things to work with.
Once the test is actually green, we can switch to a different property or expand the scope
of our generator covering the larger chunk of our domain.
And all those things in the
end taken together finally lead us to quality. But quality is not a direct result of testing.
It's a consequence of a number of choices around automation, around design, around choice
of tooling which for a given case can allow us to really speed up the testing and verification
process.
Okay, my friends, if you have any questions, I have an email address, I'm on
Twitter and if you need any other links, they're all at my website. So, if you haven't written
any questions on Slido, just shoot them over there. I'll be more than happy to get back
to you later, and with this being said, this is all I've got. You've been a wonderful audience.
Thank you all so much.