Open Bug 1887609 Opened 4 months ago Updated 15 days ago

Investigate Windows disk IO hangs

Categories

(Core :: Storage: Quota Manager, task)

Unspecified
Windows
task

Tracking

()

People

(Reporter: jstutte, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files)

It seems from several QM hangs that our synchronous disk IO will not timeout on Windows before our own hang timeouts trigger. The same would probably apply for any other disk IO happening in the same configuration, but QM might just be the first component in the shutdown sequence that does this.

From a short research it seems that at least for remote disks this is a known problem. There might be some situation related to general Windows shutdown or similar where this happens quite often during our shutdown.

Before we actually implement this (which would be an XPCOM bug) we might want to first understand the situation where this happens, that is if it is really related to remote disks only, using something like PathIsNetworkPathA in our (QM shutdown?) annotations.

See Also: → 1727526
See Also: → 1760302

Note that a possible solution might need asynchronous IO at least for some canary function we can use to check a path's availability before using it.

OS: Unspecified → Windows
Summary: CHeck if it is worth having async disk IO with timeout for GetFileAttributesW → Check if it is worth having async disk IO with timeout for GetFileAttributesW

So having a variant of IsDirectory with timeout seems relatively easy to do, see WIP patch. Now the question would be, where to insert such a canary check. My naive idea for quota manager would be to have some check during a DirectoryLock acquisition. That would ensure, that at least at the beginning of any operation we have a valid directory. But from a cursory look there seems to be quite some level of indirection that made it hard for me to spot such a good place (we need to have the effective directory path).

If we find such a good place, we could first add just a normal IsDirectory there (without timeout). If my theory is right, most of the hangs caused by blocking IO should move there then and it would be enough to have a timeout there to detect it. If I am wrong, we will probably continue to see hangs on random IO functions, but the patterns of call paths I looked at so far seem to be stable.

Any suggestions?

Flags: needinfo?(jvarga)

DirectoryLock isn't a good place for multiple reasons. First of all, it's a higher level object which doesn't do any IO, it's a synchronization primitive for various higher level operations like accessing a client directory and clearing entire origins. Another thing is that a DirectoryLock can cover multiple directories, for example it can recursively cover entire <profile>/storage.

I guess the idea is to have some kind of initial check if entire storage which we want to use is "usable", that is, IO won't cause long timeouts.
Or maybe do the check as an initial step of each higher level operation which consists of acquiring a directory lock ?

Flags: needinfo?(jvarga)

(In reply to Jan Varga [:janv] from comment #4)

DirectoryLock isn't a good place for multiple reasons.

First of all, it's a higher level object which doesn't do any IO, it's a synchronization primitive for various higher level operations like accessing a client directory and clearing entire origins.

That exactly made me think it is used at the right level of abstraction. That it does no IO currently does not mean it is not "used in the right moment" (which I am not sure of, of course).

Another thing is that a DirectoryLock can cover multiple directories, for example it can recursively cover entire <profile>/storage.

That seems wanted? Our use case is to just check some root/origin directory if the device it lives on is responding.

I guess the idea is to have some kind of initial check if entire storage which we want to use is "usable", that is, IO won't cause long timeouts.
Or maybe do the check as an initial step of each higher level operation which consists of acquiring a directory lock ?

Well, basically both. I assumed that acquiring a directory lock is such an initial step that signals upcoming IO and that it gives us also a good way of rejecting the promise chain if it fails. There might be other, equally common operations we could tie this to. I'd want to avoid to add single invocations for each specialized operation we might think could block, the ones we see in our hangs might just be the first to happen and moving forward we'll find more.

Yes, the moment is right, but a DirectoryLock is just a building block for a real operation which consists of acquiring a directory lock, doing IO on a special thread and finally sending results. So it's better to avoid putting any IO directly to DirectoryLock.
A single directory lock object which covers entire <profile>/storage (such a lock is used for example by IDB idle daily maintenance) serves as a protection for not allowing clearing of particular origins while the maintenance is still ongoing. The maintenance itself needs to scan repositories which contain origin directories. So given the maintenance example and the idea that we rely just on the moment when a directory lock is required we will call the new operation with a timeout only once for entire <profile>/storage and not before every origin directory access. However that might be ok, at least as an initial step.

And now the best part ...
All the refactoring for async storage initizalization (bug 1671932) which has been already done (and also some of the ongoing/pending refactoring/improvements) already created new high level QM methods which are suitable for adding a special IO check after a directory lock is acquired. I'm talking about:
https://searchfox.org/mozilla-central/rev/b4860b54810539f0e4ab1fc46a3246daf2428439/dom/quota/QuotaManager.h#274-300
For example we initialize storage as part of OpenClientDirectory, https://searchfox.org/mozilla-central/rev/b4860b54810539f0e4ab1fc46a3246daf2428439/dom/quota/ActorsParent.cpp#5209
We can easily add something there which will test the device against very slow IO.

Thanks for the hint!

This patch alone will most likely move (shutdown) hangs occuring due to
blocking IO on Windows to happen while inside IsBaseDirectoryAccessibleOp::
DoDirectoryWork. Once this is true, we can see how to have better
timeouts.

We were talking a bit about this in the XPCOM peers meeting. Given that in general we want our IO to finish rather than abort it, we probably should understand the circumstances when this happens better before trying to mitigate something. And mitigating this just for quota manager would also mean that the next thing doing IO would just hang afterwards.

Attachment 9393838 [details] might be still helping to investigate this as it might confirm there being a common problem.

Together with that:

we might want to first understand the situation where this happens, that is if it is really related to remote disks only, using something like PathIsNetworkPathA in our (QM shutdown?) annotations.

This could just be a general annotation set on startup as it might give valuable information also for other crashes.

Summary: Check if it is worth having async disk IO with timeout for GetFileAttributesW → Investigate Windows disk IO hangs
Depends on: 1889359
Depends on: 1889372
See Also: → 1182927
You need to log in before you can comment on or make changes to this bug.