Open Bug 1887609 Opened 4 months ago Updated 15 days ago

Investigate Windows disk IO hangs

Tracking

()

Status:

NEW

People

(Reporter: jstutte, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files)

WIP: Bug 1887609 - Have IsDirectoryWithTimeout for canary checks before doing sync IO. r?#xpcom-reviewers 3 months ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1887609 - Check if the base directory is accessible in OpenClientDirectory and OpenStorageDirectory. r?#dom-storage-reviewers 3 months ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review

Jens Stutte [:jstutte]

Reporter

Description

•

4 months ago

It seems from several QM hangs that our synchronous disk IO will not timeout on Windows before our own hang timeouts trigger. The same would probably apply for any other disk IO happening in the same configuration, but QM might just be the first component in the shutdown sequence that does this.

From a short research it seems that at least for remote disks this is a known problem. There might be some situation related to general Windows shutdown or similar where this happens quite often during our shutdown.

Before we actually implement this (which would be an XPCOM bug) we might want to first understand the situation where this happens, that is if it is really related to remote disks only, using something like PathIsNetworkPathA in our (QM shutdown?) annotations.

Jens Stutte [:jstutte]

Reporter

Updated

•

4 months ago

Updated

•

4 months ago

Comment 1

•

4 months ago

Note that a possible solution might need asynchronous IO at least for some canary function we can use to check a path's availability before using it.

OS: Unspecified → Windows

Jens Stutte [:jstutte]

Reporter

Updated

•

4 months ago

Summary: CHeck if it is worth having async disk IO with timeout for GetFileAttributesW → Check if it is worth having async disk IO with timeout for GetFileAttributesW

Jens Stutte [:jstutte]

Reporter

Comment 2

•

3 months ago

Attached file WIP: Bug 1887609 - Have IsDirectoryWithTimeout for canary checks before doing sync IO. r?#xpcom-reviewers — Details

Jens Stutte [:jstutte]

Reporter

Comment 3

•

3 months ago

So having a variant of IsDirectory with timeout seems relatively easy to do, see WIP patch. Now the question would be, where to insert such a canary check. My naive idea for quota manager would be to have some check during a DirectoryLock acquisition. That would ensure, that at least at the beginning of any operation we have a valid directory. But from a cursory look there seems to be quite some level of indirection that made it hard for me to spot such a good place (we need to have the effective directory path).

If we find such a good place, we could first add just a normal IsDirectory there (without timeout). If my theory is right, most of the hangs caused by blocking IO should move there then and it would be enough to have a timeout there to detect it. If I am wrong, we will probably continue to see hangs on random IO functions, but the patterns of call paths I looked at so far seem to be stable.

Any suggestions?

Flags: needinfo?(jvarga)

Jan Varga [:janv]

Comment 4

•

3 months ago

DirectoryLock isn't a good place for multiple reasons. First of all, it's a higher level object which doesn't do any IO, it's a synchronization primitive for various higher level operations like accessing a client directory and clearing entire origins. Another thing is that a DirectoryLock can cover multiple directories, for example it can recursively cover entire <profile>/storage.

I guess the idea is to have some kind of initial check if entire storage which we want to use is "usable", that is, IO won't cause long timeouts.
Or maybe do the check as an initial step of each higher level operation which consists of acquiring a directory lock ?

Flags: needinfo?(jvarga)

Jens Stutte [:jstutte]

Reporter

Comment 5

•

3 months ago

(In reply to Jan Varga [:janv] from comment #4)

DirectoryLock isn't a good place for multiple reasons.

First of all, it's a higher level object which doesn't do any IO, it's a synchronization primitive for various higher level operations like accessing a client directory and clearing entire origins.

That exactly made me think it is used at the right level of abstraction. That it does no IO currently does not mean it is not "used in the right moment" (which I am not sure of, of course).

Another thing is that a DirectoryLock can cover multiple directories, for example it can recursively cover entire <profile>/storage.

That seems wanted? Our use case is to just check some root/origin directory if the device it lives on is responding.

I guess the idea is to have some kind of initial check if entire storage which we want to use is "usable", that is, IO won't cause long timeouts.
Or maybe do the check as an initial step of each higher level operation which consists of acquiring a directory lock ?

Well, basically both. I assumed that acquiring a directory lock is such an initial step that signals upcoming IO and that it gives us also a good way of rejecting the promise chain if it fails. There might be other, equally common operations we could tie this to. I'd want to avoid to add single invocations for each specialized operation we might think could block, the ones we see in our hangs might just be the first to happen and moving forward we'll find more.

Jan Varga [:janv]

Comment 6

•

3 months ago

•

Edited

Yes, the moment is right, but a DirectoryLock is just a building block for a real operation which consists of acquiring a directory lock, doing IO on a special thread and finally sending results. So it's better to avoid putting any IO directly to DirectoryLock.
A single directory lock object which covers entire <profile>/storage (such a lock is used for example by IDB idle daily maintenance) serves as a protection for not allowing clearing of particular origins while the maintenance is still ongoing. The maintenance itself needs to scan repositories which contain origin directories. So given the maintenance example and the idea that we rely just on the moment when a directory lock is required we will call the new operation with a timeout only once for entire <profile>/storage and not before every origin directory access. However that might be ok, at least as an initial step.

And now the best part ...
All the refactoring for async storage initizalization (bug 1671932) which has been already done (and also some of the ongoing/pending refactoring/improvements) already created new high level QM methods which are suitable for adding a special IO check after a directory lock is acquired. I'm talking about:
https://searchfox.org/mozilla-central/rev/b4860b54810539f0e4ab1fc46a3246daf2428439/dom/quota/QuotaManager.h#274-300
For example we initialize storage as part of OpenClientDirectory, https://searchfox.org/mozilla-central/rev/b4860b54810539f0e4ab1fc46a3246daf2428439/dom/quota/ActorsParent.cpp#5209
We can easily add something there which will test the device against very slow IO.

Jens Stutte [:jstutte]

Reporter

Comment 7

•

3 months ago

Thanks for the hint!

Jens Stutte [:jstutte]

Reporter

Comment 8

•

3 months ago

Attached file WIP: Bug 1887609 - Check if the base directory is accessible in OpenClientDirectory and OpenStorageDirectory. r?#dom-storage-reviewers — Details

This patch alone will most likely move (shutdown) hangs occuring due to
blocking IO on Windows to happen while inside IsBaseDirectoryAccessibleOp::
DoDirectoryWork. Once this is true, we can see how to have better
timeouts.

Jens Stutte [:jstutte]

Reporter

Comment 9

•

3 months ago

We were talking a bit about this in the XPCOM peers meeting. Given that in general we want our IO to finish rather than abort it, we probably should understand the circumstances when this happens better before trying to mitigate something. And mitigating this just for quota manager would also mean that the next thing doing IO would just hang afterwards.

Attachment 9393838 [details] might be still helping to investigate this as it might confirm there being a common problem.

Together with that:

we might want to first understand the situation where this happens, that is if it is really related to remote disks only, using something like PathIsNetworkPathA in our (QM shutdown?) annotations.

This could just be a general annotation set on startup as it might give valuable information also for other crashes.

Summary: Check if it is worth having async disk IO with timeout for GetFileAttributesW → Investigate Windows disk IO hangs

Jens Stutte [:jstutte]

Reporter

Updated

•

3 months ago

Depends on: 1889359

Jens Stutte [:jstutte]

Reporter

Updated

•

3 months ago

Depends on: 1889372

Jens Stutte [:jstutte]

Reporter

Updated

•

15 days ago

Bugzilla

Investigate Windows disk IO hangs

Categories

(Core :: Storage: Quota Manager, task)

Tracking

()

People

(Reporter: jstutte, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Updated