What are some good approaches to predicting the completion time of a long process?_问答_开发者

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?

Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.

Current Approach:

At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)

ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone

This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).

PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.

Another idea:

The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:

ETC = currTime + currAvg * (totalSize - sizeDone)

This is kind of the opposite of the first method in that:

PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.

Finally

I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.

With that said, the other option I could think of would be to calculate the ave开发者_如何学运维rage of both of the above, perhaps with some kind of weighting:

Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.

What I am really asking for is:

Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:

collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.

I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).

An excellent resource for studying statistical learning methods is The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.

Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:

Its purpose is to use measurements observed over time, containing noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values.

The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:

5 of 10 files have been copied
10 of 200 MB have been copied

Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.

Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.

I have implemented two different solutions to address this problem:

The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.

There are two things to consider here:

the exact estimation
how to present it to the user

1. On estimation

Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.

You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).

Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:

Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55

When you have a good assessment of the likely speed, then you are close to get a good estimated time.

2. On presentation

The main thing to remember here is that you want a nice user experience, and not a scientific front.

Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.

A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:

real-completion = 0.4
presented-completion = real-completion * factor(real-completion)

Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...