Streaming: a skill gap?
11 Mar 2020
11 March 2020
4 min read
[You can upload directly from a browser to the underlying data store. For example, to S3 using presigned URLs. However, this has its own set of drawbacks, ommitted here for brevity.]
Testing You're testing an upload with a 5KB file, and it works. Are you sure it's streaming and will work with a 5GB file? There are two options that I'm aware of.
- Actually test a 5GB file [making you have less than 5GB of memory available]. While this is quite a good "real" test, it can be slow.
- Hook into both sides of the streaming process, and ensure that the target receives data before the source has sent all of its. You can do this with smaller data, and so such a test can be quick. However, this can be more brittle with respect to refactorings, i.e. the test can fail while the production behaviour continues to work.
Errors Handling errors, i.e. communicating and responding to them, can be more difficult.
Conveniently, HTTP has some of this built-in. If streaming an HTTP body with a
content-length header specifying the number of bytes, if the receiver doesn't receive that amount by the time the connection has closed, they know an error has occurred. If
transfer-encoding: chunked is used, if the receiver doesn't receive a 0-length chunk at the end, they know there has been an error.
It's not perfect though: there is no way to send an HTTP status code once the body has begun to stream. But for many situations, this is enough.
What to do when an error has occurred may be more tricky. With a non-streaming multi-stage pipeline, if one part fails, you can usually retry because you have the source bytes to retry with. However if streaming, the bytes have gone. To retry, have to build in a mechanism to re-retrieve them from the source.
Complexity Especially when considering error handling, retrying, or say, efficiently dealing with bandwidth differences/variation in different parts of the stream, there could be more complexity compared to a non-streaming solution.
This being said, a) you may not need to implement such things [e.g. OS-provided TCP buffers may adequately compensate for bandwidth variation], and b) I suspect the complexity is sometimes overstated and conflated with unfamiliarity [although it would be naive to think this is isn't a problem, as mentioned below].
Performance Ironically, there might be a performance penalty compared to non-streaming solutions due to what could be radically different operations / orders of operations. This could be especially true if using streaming for smaller amounts of data.
Homogeneity Each part of the pipeline needs to support streaming. It's not the default in a lot of cases: which is unfortunate since you can use code that supports streaming to process data in a non-streaming way [by just using a single "chunk"], but it's impossible to do the opposite.
Unfamiliarity Streaming has an unfortunate problem: it's the skill gap itself.
Since fewer developers are familiar with it, issues are less likely to be spotted in code reviews, streaming behaviour may be accidentally broken [if there aren't appropriate tests on it], there are fewer people to ask for help, and unfortunately, any help that is given has a higher chance of being misleading.
This is admittedly a bit of a chicken/egg situation!
Wonderfully, I think you can get a lot of valuable experience from just a few small practice web-based projects.
- A GET endpoint that responds with a generated HTTP response of several GBs, just of some fake data.
- A GET endpoint that responds with a file from the filesystem of several GBs. Try with both
transfer-encoding: chunkedand with a specified
- Proxying a file to or from S3 through a server. Try with a plain HTTP client, not just one that is AWS-aware such as Boto3.
- Downloading a Postgres table of several GBs. Try with just a single query. Try responding with CSV or JSON.
- Accept a large CSV upload and calculate some basic stats on the columns while it is being uploaded, e.g. min, max, mean, standard deviation.
Once you have done these, you would be in a much better place to weigh up the trade-offs to know if a streaming solution is right for any given real-world project. At the very least, you'll be in a better place to review colleagues' streaming-based code.
Michal CharemzaSee other articles by Michal