11 Mar 2020
4 min read
I've noticed a bit of a skill gap: I think a lot of developers are not able to code up "streaming" solutions to problems.
However, streaming can often be useful, even needed, in what are now run-of-the-mill web applications; and wonderfully, we often don't need anything fancier than the tools already being used: we just need to know how to use them.
Any situation when you process data concurrently with receiving it. This process can be analyzing the data, or just forwarding it onwards.
There are two main [potential] benefits.
If you start processing the data sooner, before its all received, then you [might] finish sooner.
**Support higher concurrency / size limits**
Say you would like users to be able to upload 500mb files: in these days of video and hi-res images, this isn't a far-fetched requirement, even for a standard web application. If you don't forward the uploaded data onwards while it's still being uploaded, just a few users uploading concurrently could use all the memory on a server.
[You can upload directly from a browser to the underlying data store. For example, to S3 using presigned URLs. However, this has its own set of drawbacks, ommitted here for brevity.]
Streaming is not a perfect/one-size-fits-all solution: it does have its downsides.
Testing You're testing an upload with a 5KB file, and it works. Are you sure it's streaming and will work with a 5GB file? There are two options that I'm aware of.
Errors Handling errors, i.e. communicating and responding to them, can be more difficult.
Conveniently, HTTP has some of this built-in. If streaming an HTTP body with a
content-length header specifying the number of bytes, if the receiver doesn't receive that amount by the time the connection has closed, they know an error has occurred. If
transfer-encoding: chunked is used, if the receiver doesn't receive a 0-length chunk at the end, they know there has been an error.
It's not perfect though: there is no way to send an HTTP status code once the body has begun to stream. But for many situations, this is enough.
What to do when an error has occurred may be more tricky. With a non-streaming multi-stage pipeline, if one part fails, you can usually retry because you have the source bytes to retry with. However if streaming, the bytes have gone. To retry, have to build in a mechanism to re-retrieve them from the source.
Complexity Especially when considering error handling, retrying, or say, efficiently dealing with bandwidth differences/variation in different parts of the stream, there could be more complexity compared to a non-streaming solution.
This being said, a) you may not need to implement such things [e.g. OS-provided TCP buffers may adequately compensate for bandwidth variation], and b) I suspect the complexity is sometimes overstated and conflated with unfamiliarity [although it would be naive to think this is isn't a problem, as mentioned below].
Performance Ironically, there might be a performance penalty compared to non-streaming solutions due to what could be radically different operations / orders of operations. This could be especially true if using streaming for smaller amounts of data.
Homogeneity Each part of the pipeline needs to support streaming. It's not the default in a lot of cases: which is unfortunate since you can use code that supports streaming to process data in a non-streaming way [by just using a single "chunk"], but it's impossible to do the opposite.
Unfamiliarity Streaming has an unfortunate problem: it's the skill gap itself.
Since fewer developers are familiar with it, issues are less likely to be spotted in code reviews, streaming behaviour may be accidentally broken [if there aren't appropriate tests on it], there are fewer people to ask for help, and unfortunately, any help that is given has a higher chance of being misleading.
This is admittedly a bit of a chicken/egg situation!
Wonderfully, I think you can get a lot of valuable experience from just a few small practice web-based projects.
transfer-encoding: chunkedand with a specified
Once you have done these, you would be in a much better place to weigh up the trade-offs to know if a streaming solution is right for any given real-world project. At the very least, you'll be in a better place to review colleagues' streaming-based code.
See other articles by Michal
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!