2017-08-29 LFS GSoC summary

I've started with the goal of resolving the LFS bugs and modernizing the codebase
which was using deprecated APIs like tsleep.

Significant bug fixes were:
- Found and helped diagnose buffer overflow under unusual circumstances
(heavy use under COMPAT_LINUX):

fix buffer overflow/KASSERT when cookies are supplied
cvs rdiff -u -r1.49 -r1.50 src/sys/ufs/lfs/ulfs_vnops.c

- Helped discover lock reversal between lfs_writer_enter and lfs_seglock, the
primary cause of deadlocks in LFS. This is detailed in the following bug report:
http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52301

Unfortunately, the fix for it, while it works, introduced an extreme
performance regression. I've had to revert it and look for causes for
the performance regressions.

I suspect this is because we wait for disk operations to fully occur while
MARK_VNODE'd, and so letting go of all marked vnodes can take a long time.

The implicit use of KERNEL_LOCK by the filesystem not being marked MPSAFE
is making things seem worse.

- Discovered insufficient locking in manipulating on-disk inode data, causing
a data race. this is causing asserts but also appears as fsck inconsistencies,
as the number of blocks doesn't match.

This appears as KASSERTS about truncating to zero, but having more than
zero effective blocks by the end of lfs_truncate, when e.g. running firefox.

This is because lfs_writevnodes and other iterators on vnodes are not holding
vn_lock (it's necessary while manipulating any lfs_dino_*).

Attempting to fix the above, I've established a justified theory for why
the sane "locking" order not requiring restarting or juggling locks is
vn_lock -> lfs_writer_enter -> lfs_seglock.

This is detailed more in: http://coypu.sdf.org/2017-08-21-LFS
It's also documented in a bug report:
http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52510

Unfortunately, fixing this is not immediate - their callers also hold vn_lock
sometimes. this is a patch starting work on it:
http://coypu.sdf.org/lfs-vnlock-writevnodes2

Besides some callers still holding vn_lock for one file, we now run into
deadlocks. I *suspect* this is because we violate the lock ordering, and
grab vn_lock after grabbing seglock.

Some scheme to juggle locks when it turns out we can't grab all of them,
while still maintaining filesystem consistency, must be devised.


I've also made the following cleanup commits and modernizing code, as well
as adding and removing comments clarifying things:

- Renamed i_flag to i_state, as "flags" exists as well and was the cause of mistakes
that only by coincidence did not result in very bad bugs
- Replaced many users of mtsleep and tsleep with condvars.
- Remove uses of splbio, no longer necessary.
- Wrote XXX comments about some more flaws I ran into, but didn't start
  investigating.

All in all, I've made many separate commits to LFS related code to NetBSD
src during GSoC and a little before the official start. they are individually
visible in the following link:
https://v4.freshbsd.org/search?q=lfs&committer%5B%5D=maya&sort=commit_date