The OS breaks all write and read requests into 4K blocks (if the request is larger than that), as that is the memory page resolution for a virtual memory Intel system. The only way to do actual larger blocks is to by-pass the operating system and do raw reads/writes, which do not need a filesystem.
If the driver supports scatter/gather, then the writes can be combined, if they are sequential on the media. The purpose of using multiple threads is to mitigate the sequential nature of a single thread. The operating system is always writing data, independent of the application.
By writing 4K chunks (or smaller), it bypasses the scatter/gather attempts of the driver, if applicable. Using different size writes, below that (on 1K boundaries) imposes a higher degree of randomness in the testing, when using multiple threads.
Doing a full erase (filesystem level, or drive level?) and then filling it up again is not realistic at all. Fill the drive with files, then update the files and change the size of those more closely resembles what happens in the reality. Doing what they are doing is a general reliability test which cannot be related to real usage.
My interests lie in what would a real-life test be like compared to this static canned test. Still need the actual count of operations performed to give the data some real meaning in terms of longevity.