TTFB – This Isn’t The Metric You’re Looking For
TTFB (Time to First Byte) is a metric used by Open Banking UK, and was originally defined by the Open Banking Implementation Entity. The trouble is, it’s also something of an example of Goodhart’s Law which is hugely problematic for monitoring.
Essentially Goodhart states:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
So, harkening back to my previous comments about how self-serving the monitoring industry is, if you define a metric, then you’re defining something somebody can game if you don’t pay attention.
This came up during a frank and open exchange of views(1) we were having with a client and OBUK about a metric we were recording for an API. But rather than use the client, who shall remain nameless, as an example, we’ll use two public examples which are extremely interesting and real world.
The problem stemmed from the nature of the defined specification.
Time to First Byte (TTFB) is defined in the
context of the OBIE reporting template as the total time taken for an API
endpoint call to generate a response message and start providing to AISPs,
PISPs or CBPIIs all the information as required as defined in 36(1)(a),
36(1)(b), 36(1)(c) of the RTS and 66(4)(b), 65(3) of the PSD2. The response time clock should start at the point the endpoint call is fully received by the ASPSP and should stop at the point the first byte of the response message is transmitted to the AISP, PISP, or CBPII. For further clarifications on TTLB and TTFB, please refer
to the Notes section at the end of the table…”
What this looks like in diagram form is this:
So TTFB is basically (t6)-(t1) – we keeping up? Good. Here’s the thing, the spec is making an assumption that each of the steps in the diagram happens sequentially. That the gateway waits to receive the whole request, it then does some work, sends it to the downstream service. It then waits for that service to respond, before finally sending the response back to the client in one chunk.
The trouble is, that’s not how it always works. A gateway may be capable (or even designed) of responding to a request before the full response has even been calculated. As we’re talking about an HTTP response, we know the first 9 characters of the response will always be “HTTP/1.1”. A gateway can send those immediately and then wait to see what status code it needs to send. It can stream the remaining bytes of the response as and when they’re calculated. In this case, t2 and t6 happen at the same time.
The time to first byte is just telling you how quickly the gateway can respond with an ‘H’. Why ‘H’ you ask? Well, because– even if the rest of the stuff comes down the pipe seconds later the first byte is always an H.
What they want is a metric for when the first meaningful chunk of data is generated by the back-end – but the quick among you will also see that this can be gamed REALLY easily too. If it’s a JSON API, in all likelihood that’s a “{“.
So what you can also, perfectly validly have is this setup:
Here’s an example from two almost identical APIs from different providers, which shows the problem in more detail – BOX and DROPBOX. First graph here is the processing time as measured at the t1, t6 point.
The expectation is that everybody behaves like Box do: the query arrives and is processed, and then the gateway starts sending back the file information.
But look at Dropbox. They respond instantly that yes, they received the query and they’ll get back to you. You could argue that’s bad practice, because you could hide poor performance in there.
Well, yes you could, but only if your performance is bad.
If we look at the actual round-trip times to last byte for both services, Dropbox manages to be slightly faster overall than Box.
So, the problem here is that the TTFB metric isn’t actually telling anybody what they think it is, and, furthermore, we would argue that from a strict API performance metric, TTFB as defined in the example isn’t going to tell anybody anything. You really can’t ignore where the call is being made from and what impact the internet has on the process, otherwise you’re essentially self-reporting that your gateway is, indeed, very fast indeed and anybody who says otherwise is a stinking liar.
The way to address this is to be clear on your definitions and, if you’re a regulatory body, be prepared to change your opinions over time as new data comes to mind. You also need to avoid metrics that are easy to measure, because then they’re likely to be easy to game at the point of measurement too. As our old friend Goodhart said!
If we go back to the OBUK example – the TTFB metric of t6-t1 is essentially meaningless if somebody disputes it. No 3rd party AISPSP is going to see that number where their application is running, they’re going to be seeing something like (t6-t1)+Handshake+TCP Connection+whatever DNS is doing to them that day.
Of course, the alternative, Time to Last Byte (TTLB) only becomes useful in comparisons if you’re comparing the same size of response as we are in the Box/Dropbox example. It’s unfair to compare an API returning 10 items with an API returning 1000 items only by TTLB.
Anyway, this has been a long-winded way to say be careful of what you ask for and be careful of what you want measured – measure many things and agree what are actually the most useful for comparing different services from different vendors. You could use a blended metric that does some very clever patented statistical analysis over time on consistency of service and quality in a way that allows for comparisons. You could call it the Cloud API Service Consistency (CASC) Score – and when you’ve realized you needed it all along, give me a call.
But do be careful of TTFB, it’s not the metric you’re looking for.
(1) this sounds like it’s a good thing, but if you’re not aware, it’s diplomat speak for an argument.