From 7433c593d533e58716634f7577b17d56340fff51 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 28 Jun 2023 11:04:31 -0700 Subject: [PATCH 1/6] xfs: don't reverse order of items in bulk AIL insertion ANBZ: #24564 commit 939bd50dfbe7c17d958a62208e8b584442759bf5 upstream. XFS has strict metadata ordering requirements. One of the things it does is maintain the commit order of items from transaction commit through the CIL and into the AIL. That is, if a transaction logs item A before item B in a modification, then they will be inserted into the CIL in the order {A, B}. These items are then written into the iclog during checkpointing in the order {A, B}. When the checkpoint commits, they are supposed to be inserted into the AIL in the order {A, B}, and when they are pushed from the AIL, they are pushed in the order {A, B}. If we crash, log recovery then replays the two items from the checkpoint in the order {A, B}, resulting in the objects the items apply to being queued for writeback at the end of the checkpoint in the order {A, B}. This means recovery behaves the same way as the runtime code. In places, we have subtle dependencies on this ordering being maintained. One of this place is performing intent recovery from the log. It assumes that recovering an intent will result in a non-intent object being the first thing that is modified in the recovery transaction, and so when the transaction commits and the journal flushes, the first object inserted into the AIL beyond the intent recovery range will be a non-intent item. It uses the transistion from intent items to non-intent items to stop the recovery pass. A recent log recovery issue indicated that an intent was appearing as the first item in the AIL beyond the recovery range, hence breaking the end of recovery detection that exists. Tracing indicated insertion of the items into the AIL was apparently occurring in the right order (the intent was last in the commit item list), but the intent was appearing first in the AIL. IOWs, the order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk insertion was reversing the order of the items in the batch of items being inserted. Lucky for us, all the items fed to bulk insertion have the same LSN, so the reversal of order does not affect the log head/tail tracking that is based on the contents of the AIL. It only impacts on code that has implicit, subtle dependencies on object order, and AFAICT only the intent recovery loop is impacted by it. Make sure bulk AIL insertion does not reorder items incorrectly. Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit") Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Signed-off-by: Gao Xiang --- fs/xfs/xfs_trans_ail.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index dbb69b4bf3ed..e34458f69fb7 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -806,7 +806,7 @@ xfs_trans_ail_update_bulk( trace_xfs_ail_insert(lip, 0, lsn); } lip->li_lsn = lsn; - list_add(&lip->li_ail, &tmp); + list_add_tail(&lip->li_ail, &tmp); } if (!list_empty(&tmp)) -- Gitee From f4484c1df07735dd77bb0226988417ffb8219c03 Mon Sep 17 00:00:00 2001 From: Chandan Babu R Date: Fri, 2 Apr 2021 15:07:33 -0700 Subject: [PATCH 2/6] xfs: Use struct xfs_bmdr_block instead of struct xfs_btree_block to calculate root node size ANBZ: #24564 commit b6785e279d53ca5c4fa6be1146e85000870d73ef upstream. The incore data fork of an inode stores the bmap btree root node as 'struct xfs_btree_block'. However, the ondisk version of the inode stores the bmap btree root node as a 'struct xfs_bmdr_block'. xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block' to compute the size of the bmap btree root node. Since size of 'struct xfs_btree_block' is larger than that of 'struct xfs_bmdr_block', xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current root node as the child of newly allocated root node. This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree() to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits inside the data fork of the inode. Reviewed-by: Christoph Hellwig Signed-off-by: Chandan Babu R Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Gao Xiang --- fs/xfs/libxfs/xfs_bmap.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 7ace8755d353..f2dc7ad23bc0 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -881,13 +881,15 @@ xfs_bmap_add_attrfork_btree( xfs_inode_t *ip, /* incore inode pointer */ int *flags) /* inode logging flags */ { + struct xfs_btree_block *block = ip->i_df.if_broot; xfs_btree_cur_t *cur; /* btree cursor */ int error; /* error return value */ xfs_mount_t *mp; /* file system mount struct */ int stat; /* newroot status */ mp = ip->i_mount; - if (ip->i_df.if_broot_bytes <= XFS_IFORK_DSIZE(ip)) + + if (XFS_BMAP_BMDR_SPACE(block) <= XFS_IFORK_DSIZE(ip)) *flags |= XFS_ILOG_DBROOT; else { cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK); -- Gitee From 22d335e9bdb268403c99bf9adb58856c7b318dbe Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Thu, 16 Sep 2021 12:18:47 -0700 Subject: [PATCH 3/6] xfs: remove xfs_btree_cur_t typedef ANBZ: #24564 commit ae127f087dc22b6e37edc870079abf0721a6aed0 upstream. Get rid of this old typedef before we start changing other things. Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Reviewed-by: Christoph Hellwig --- fs/xfs/libxfs/xfs_alloc.c | 12 ++++++------ fs/xfs/libxfs/xfs_bmap.c | 12 ++++++------ fs/xfs/libxfs/xfs_btree.c | 12 ++++++------ fs/xfs/libxfs/xfs_btree.h | 12 ++++++------ 4 files changed, 24 insertions(+), 24 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 8a10d1755673..073892c40899 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -426,8 +426,8 @@ xfs_alloc_fix_len( */ STATIC int /* error code */ xfs_alloc_fixup_trees( - xfs_btree_cur_t *cnt_cur, /* cursor for by-size btree */ - xfs_btree_cur_t *bno_cur, /* cursor for by-block btree */ + struct xfs_btree_cur *cnt_cur, /* cursor for by-size btree */ + struct xfs_btree_cur *bno_cur, /* cursor for by-block btree */ xfs_agblock_t fbno, /* starting block of free extent */ xfs_extlen_t flen, /* length of free extent */ xfs_agblock_t rbno, /* starting block of returned extent */ @@ -1214,8 +1214,8 @@ xfs_alloc_ag_vextent_exact( xfs_alloc_arg_t *args) /* allocation argument structure */ { struct xfs_agf __maybe_unused *agf = args->agbp->b_addr; - xfs_btree_cur_t *bno_cur;/* by block-number btree cursor */ - xfs_btree_cur_t *cnt_cur;/* by count btree cursor */ + struct xfs_btree_cur *bno_cur;/* by block-number btree cursor */ + struct xfs_btree_cur *cnt_cur;/* by count btree cursor */ int error; xfs_agblock_t fbno; /* start block of found extent */ xfs_extlen_t flen; /* length of found extent */ @@ -1672,8 +1672,8 @@ xfs_alloc_ag_vextent_size( xfs_alloc_arg_t *args) /* allocation argument structure */ { struct xfs_agf *agf = args->agbp->b_addr; - xfs_btree_cur_t *bno_cur; /* cursor for bno btree */ - xfs_btree_cur_t *cnt_cur; /* cursor for cnt btree */ + struct xfs_btree_cur *bno_cur; /* cursor for bno btree */ + struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */ int error; /* error result */ xfs_agblock_t fbno; /* start of found freespace */ xfs_extlen_t flen; /* length of found freespace */ diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index f2dc7ad23bc0..23f17d9059d5 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -313,7 +313,7 @@ xfs_check_block( */ STATIC void xfs_bmap_check_leaf_extents( - xfs_btree_cur_t *cur, /* btree cursor or null */ + struct xfs_btree_cur *cur, /* btree cursor or null */ xfs_inode_t *ip, /* incore inode pointer */ int whichfork) /* data or attr fork */ { @@ -882,7 +882,7 @@ xfs_bmap_add_attrfork_btree( int *flags) /* inode logging flags */ { struct xfs_btree_block *block = ip->i_df.if_broot; - xfs_btree_cur_t *cur; /* btree cursor */ + struct xfs_btree_cur *cur; /* btree cursor */ int error; /* error return value */ xfs_mount_t *mp; /* file system mount struct */ int stat; /* newroot status */ @@ -925,7 +925,7 @@ xfs_bmap_add_attrfork_extents( struct xfs_inode *ip, /* incore inode pointer */ int *flags) /* inode logging flags */ { - xfs_btree_cur_t *cur; /* bmap btree cursor */ + struct xfs_btree_cur *cur; /* bmap btree cursor */ int error; /* error return value */ if (ip->i_df.if_nextents * sizeof(struct xfs_bmbt_rec) <= @@ -1987,11 +1987,11 @@ xfs_bmap_add_extent_unwritten_real( xfs_inode_t *ip, /* incore inode pointer */ int whichfork, struct xfs_iext_cursor *icur, - xfs_btree_cur_t **curp, /* if *curp is null, not a btree */ + struct xfs_btree_cur **curp, /* if *curp is null, not a btree */ xfs_bmbt_irec_t *new, /* new data to add to file extents */ int *logflagsp) /* inode logging flags */ { - xfs_btree_cur_t *cur; /* btree cursor */ + struct xfs_btree_cur *cur; /* btree cursor */ int error; /* error return value */ int i; /* temp state */ struct xfs_ifork *ifp; /* inode fork pointer */ @@ -5126,7 +5126,7 @@ xfs_bmap_del_extent_real( xfs_inode_t *ip, /* incore inode pointer */ xfs_trans_t *tp, /* current transaction pointer */ struct xfs_iext_cursor *icur, - xfs_btree_cur_t *cur, /* if null, not a btree */ + struct xfs_btree_cur *cur, /* if null, not a btree */ xfs_bmbt_irec_t *del, /* data to remove from extents */ int *logflagsp, /* inode logging flags */ int whichfork, /* data or attr fork */ diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index eb911b4e1620..b30001a85792 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -391,14 +391,14 @@ xfs_btree_del_cursor( */ int /* error */ xfs_btree_dup_cursor( - xfs_btree_cur_t *cur, /* input cursor */ - xfs_btree_cur_t **ncur) /* output cursor */ + struct xfs_btree_cur *cur, /* input cursor */ + struct xfs_btree_cur **ncur) /* output cursor */ { xfs_buf_t *bp; /* btree block's buffer pointer */ int error; /* error return value */ int i; /* level number of btree block */ xfs_mount_t *mp; /* mount structure for filesystem */ - xfs_btree_cur_t *new; /* new cursor value */ + struct xfs_btree_cur *new; /* new cursor value */ xfs_trans_t *tp; /* transaction pointer, can be NULL */ tp = cur->bc_tp; @@ -694,7 +694,7 @@ xfs_btree_get_block( */ STATIC int /* success=1, failure=0 */ xfs_btree_firstrec( - xfs_btree_cur_t *cur, /* btree cursor */ + struct xfs_btree_cur *cur, /* btree cursor */ int level) /* level to change */ { struct xfs_btree_block *block; /* generic btree block pointer */ @@ -724,7 +724,7 @@ xfs_btree_firstrec( */ STATIC int /* success=1, failure=0 */ xfs_btree_lastrec( - xfs_btree_cur_t *cur, /* btree cursor */ + struct xfs_btree_cur *cur, /* btree cursor */ int level) /* level to change */ { struct xfs_btree_block *block; /* generic btree block pointer */ @@ -988,7 +988,7 @@ xfs_btree_readahead_ptr( */ STATIC void xfs_btree_setbuf( - xfs_btree_cur_t *cur, /* btree cursor */ + struct xfs_btree_cur *cur, /* btree cursor */ int lev, /* level in btree */ xfs_buf_t *bp) /* new buffer to set */ { diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 8885b00020f6..221b8a7c2f35 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -215,7 +215,7 @@ struct xfs_btree_cur_ino { * Btree cursor structure. * This collects all information needed by the btree code in one place. */ -typedef struct xfs_btree_cur +struct xfs_btree_cur { struct xfs_trans *bc_tp; /* transaction we're in, if any */ struct xfs_mount *bc_mp; /* file system mount struct */ @@ -235,7 +235,7 @@ typedef struct xfs_btree_cur struct xfs_btree_cur_ag bc_ag; struct xfs_btree_cur_ino bc_ino; }; -} xfs_btree_cur_t; +}; /* cursor flags */ #define XFS_BTREE_LONG_PTRS (1<<0) /* pointers are 64bits long */ @@ -301,7 +301,7 @@ xfs_btree_check_sptr( */ void xfs_btree_del_cursor( - xfs_btree_cur_t *cur, /* btree cursor */ + struct xfs_btree_cur *cur, /* btree cursor */ int error); /* del because of error */ /* @@ -310,8 +310,8 @@ xfs_btree_del_cursor( */ int /* error */ xfs_btree_dup_cursor( - xfs_btree_cur_t *cur, /* input cursor */ - xfs_btree_cur_t **ncur);/* output cursor */ + struct xfs_btree_cur *cur, /* input cursor */ + struct xfs_btree_cur **ncur);/* output cursor */ /* * Compute first and last byte offsets for the fields given. @@ -516,7 +516,7 @@ struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur); /* Does this cursor point to the last block in the given level? */ static inline bool xfs_btree_islastblock( - xfs_btree_cur_t *cur, + struct xfs_btree_cur *cur, int level) { struct xfs_btree_block *block; -- Gitee From 771c240bd4cb104f05cb785b1700123b0da87a3a Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 28 Jun 2023 11:04:32 -0700 Subject: [PATCH 4/6] xfs: pass alloc flags through to xfs_extent_busy_flush() ANBZ: #24564 commit 6a2a9d776c4ae24a797e25eed2b9f7f33f756296 upstream. To avoid blocking in xfs_extent_busy_flush() when freeing extents and the only busy extents are held by the current transaction, we need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way into xfs_extent_busy_flush(). Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Conflicts: fs/xfs/libxfs/xfs_alloc.c fs/xfs/libxfs/xfs_alloc.h Signed-off-by: Gao Xiang --- fs/xfs/libxfs/xfs_alloc.c | 90 +++++++++++++++++++++------------------ fs/xfs/libxfs/xfs_alloc.h | 2 +- fs/xfs/xfs_extent_busy.c | 3 +- fs/xfs/xfs_extent_busy.h | 2 +- 4 files changed, 52 insertions(+), 45 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 073892c40899..99cb92bd6216 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -37,8 +37,8 @@ struct workqueue_struct *xfs_alloc_wq; #define XFSA_FIXUP_CNT_OK 2 STATIC int xfs_alloc_ag_vextent_exact(xfs_alloc_arg_t *); -STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *); -STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *); +STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *, uint32_t); +STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *, uint32_t); /* * Size of the AGFL. For CRC-enabled filesystes we steal a couple of slots in @@ -1141,7 +1141,8 @@ xfs_alloc_ag_vextent_small( */ STATIC int /* error */ xfs_alloc_ag_vextent( - xfs_alloc_arg_t *args) /* argument structure for allocation */ + xfs_alloc_arg_t *args, /* argument structure for allocation */ + uint32_t alloc_flags) { int error=0; @@ -1157,10 +1158,10 @@ xfs_alloc_ag_vextent( args->wasfromfl = 0; switch (args->type) { case XFS_ALLOCTYPE_THIS_AG: - error = xfs_alloc_ag_vextent_size(args); + error = xfs_alloc_ag_vextent_size(args, alloc_flags); break; case XFS_ALLOCTYPE_NEAR_BNO: - error = xfs_alloc_ag_vextent_near(args); + error = xfs_alloc_ag_vextent_near(args, alloc_flags); break; case XFS_ALLOCTYPE_THIS_BNO: error = xfs_alloc_ag_vextent_exact(args); @@ -1568,7 +1569,8 @@ xfs_alloc_ag_vextent_lastblock( */ STATIC int xfs_alloc_ag_vextent_near( - struct xfs_alloc_arg *args) + struct xfs_alloc_arg *args, + uint32_t alloc_flags) { struct xfs_alloc_cur acur = {}; int error; /* error code */ @@ -1644,7 +1646,7 @@ xfs_alloc_ag_vextent_near( if (acur.busy) { trace_xfs_alloc_near_busy(args); xfs_extent_busy_flush(args->mp, args->pag, - acur.busy_gen); + acur.busy_gen, alloc_flags); goto restart; } trace_xfs_alloc_size_neither(args); @@ -1667,21 +1669,22 @@ xfs_alloc_ag_vextent_near( * and of the form k * prod + mod unless there's nothing that large. * Return the starting a.g. block, or NULLAGBLOCK if we can't do it. */ -STATIC int /* error */ +static int xfs_alloc_ag_vextent_size( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args, + uint32_t alloc_flags) { - struct xfs_agf *agf = args->agbp->b_addr; - struct xfs_btree_cur *bno_cur; /* cursor for bno btree */ - struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */ - int error; /* error result */ - xfs_agblock_t fbno; /* start of found freespace */ - xfs_extlen_t flen; /* length of found freespace */ - int i; /* temp status variable */ - xfs_agblock_t rbno; /* returned block number */ - xfs_extlen_t rlen; /* length of returned extent */ - bool busy; - unsigned busy_gen; + struct xfs_agf *agf = args->agbp->b_addr; + struct xfs_btree_cur *bno_cur; + struct xfs_btree_cur *cnt_cur; + xfs_agblock_t fbno; /* start of found freespace */ + xfs_extlen_t flen; /* length of found freespace */ + xfs_agblock_t rbno; /* returned block number */ + xfs_extlen_t rlen; /* length of returned extent */ + bool busy; + unsigned busy_gen; + int error; + int i; restart: /* @@ -1750,8 +1753,8 @@ xfs_alloc_ag_vextent_size( xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, - args->pag, busy_gen); + xfs_extent_busy_flush(args->mp, args->pag, + busy_gen, alloc_flags); goto restart; } } @@ -1835,7 +1838,8 @@ xfs_alloc_ag_vextent_size( if (busy) { xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, busy_gen); + xfs_extent_busy_flush(args->mp, args->pag, busy_gen, + alloc_flags); goto restart; } goto out_nominleft; @@ -2559,7 +2563,7 @@ __xfs_free_extent_later( int /* error */ xfs_alloc_fix_freelist( struct xfs_alloc_arg *args, /* allocation argument structure */ - int flags) /* XFS_ALLOC_FLAG_... */ + uint32_t alloc_flags) { struct xfs_mount *mp = args->mp; struct xfs_perag *pag = args->pag; @@ -2575,7 +2579,7 @@ xfs_alloc_fix_freelist( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); if (!pag->pagf_init) { - error = xfs_alloc_read_agf(mp, tp, args->agno, flags, &agbp); + error = xfs_alloc_read_agf(mp, tp, args->agno, alloc_flags, &agbp); if (error) { /* Couldn't lock the AGF so skip this AG. */ if (error == -EAGAIN) @@ -2590,13 +2594,13 @@ xfs_alloc_fix_freelist( * point */ if (pag->pagf_metadata && (args->datatype & XFS_ALLOC_USERDATA) && - (flags & XFS_ALLOC_FLAG_TRYLOCK)) { - ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING)); + (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) { + ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING)); goto out_agbp_relse; } need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags | + if (!xfs_alloc_space_available(args, need, alloc_flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse; @@ -2605,7 +2609,7 @@ xfs_alloc_fix_freelist( * Can fail if we're not blocking on locks, and it's held. */ if (!agbp) { - error = xfs_alloc_read_agf(mp, tp, args->agno, flags, &agbp); + error = xfs_alloc_read_agf(mp, tp, args->agno, alloc_flags, &agbp); if (error) { /* Couldn't lock the AGF so skip this AG. */ if (error == -EAGAIN) @@ -2620,7 +2624,7 @@ xfs_alloc_fix_freelist( /* If there isn't enough total space or single-extent, reject it. */ need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags)) + if (!xfs_alloc_space_available(args, need, alloc_flags)) goto out_agbp_relse; /* @@ -2649,11 +2653,12 @@ xfs_alloc_fix_freelist( */ memset(&targs, 0, sizeof(targs)); /* struct copy below */ - if (flags & XFS_ALLOC_FLAG_NORMAP) + if (alloc_flags & XFS_ALLOC_FLAG_NORMAP) targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; else targs.oinfo = XFS_RMAP_OINFO_AG; - while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) { + while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) && + pag->pagf_flcount > need) { error = xfs_alloc_get_freelist(tp, agbp, &bno, 0); if (error) goto out_agbp_relse; @@ -2682,7 +2687,7 @@ xfs_alloc_fix_freelist( targs.resv = XFS_AG_RESV_AGFL; /* Allocate as many blocks as possible at once. */ - error = xfs_alloc_ag_vextent(&targs); + error = xfs_alloc_ag_vextent(&targs, alloc_flags); if (error) goto out_agflbp_relse; @@ -2692,7 +2697,7 @@ xfs_alloc_fix_freelist( * on a completely full ag. */ if (targs.agbno == NULLAGBLOCK) { - if (flags & XFS_ALLOC_FLAG_FREEING) + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) break; goto out_agflbp_relse; } @@ -3149,7 +3154,7 @@ xfs_alloc_vextent( { xfs_agblock_t agsize; /* allocation group size */ int error; - int flags; /* XFS_ALLOC_FLAG_... locking flags */ + uint32_t alloc_flags; /* XFS_ALLOC_FLAG_... locking flags */ struct xfs_mount *mp; /* mount structure pointer */ xfs_agnumber_t sagno; /* starting allocation group number */ xfs_alloctype_t type; /* input allocation type */ @@ -3202,7 +3207,7 @@ xfs_alloc_vextent( break; } args->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno); - if ((error = xfs_alloc_ag_vextent(args))) + if ((error = xfs_alloc_ag_vextent(args, 0))) goto error0; break; case XFS_ALLOCTYPE_START_BNO: @@ -3231,13 +3236,13 @@ xfs_alloc_vextent( args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno); args->type = XFS_ALLOCTYPE_THIS_AG; sagno = 0; - flags = 0; + alloc_flags = 0; } else { /* * Start with the given allocation group. */ args->agno = sagno = XFS_FSB_TO_AGNO(mp, args->fsbno); - flags = XFS_ALLOC_FLAG_TRYLOCK; + alloc_flags = XFS_ALLOC_FLAG_TRYLOCK; } /* * Loop over allocation groups twice; first time with @@ -3245,7 +3250,7 @@ xfs_alloc_vextent( */ for (;;) { args->pag = xfs_perag_get(mp, args->agno); - error = xfs_alloc_fix_freelist(args, flags); + error = xfs_alloc_fix_freelist(args, alloc_flags); if (error) { trace_xfs_alloc_vextent_nofix(args); goto error0; @@ -3254,7 +3259,8 @@ xfs_alloc_vextent( * If we get a buffer back then the allocation will fly. */ if (args->agbp) { - if ((error = xfs_alloc_ag_vextent(args))) + if ((error = xfs_alloc_ag_vextent(args, + alloc_flags))) goto error0; break; } @@ -3285,13 +3291,13 @@ xfs_alloc_vextent( * or switch to non-trylock mode. */ if (args->agno == sagno) { - if (flags == 0) { + if (alloc_flags == 0) { args->agbno = NULLAGBLOCK; trace_xfs_alloc_vextent_allfailed(args); break; } - flags = 0; + alloc_flags = 0; if (type == XFS_ALLOCTYPE_START_BNO) { args->agbno = XFS_FSB_TO_AGBNO(mp, args->fsbno); diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 9ccec5d42c97..711cf16da35b 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -210,7 +210,7 @@ int xfs_alloc_read_agfl(struct xfs_mount *mp, struct xfs_trans *tp, xfs_agnumber_t agno, struct xfs_buf **bpp); int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t, struct xfs_buf *, struct xfs_owner_info *); -int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags); +int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags); int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno, struct xfs_buf **agbp); diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index a4075685d9eb..8235cdcc8b1c 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -582,7 +582,8 @@ void xfs_extent_busy_flush( struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen) + unsigned busy_gen, + uint32_t alloc_flags) { DEFINE_WAIT (wait); int error; diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 8aea07100092..d70d395ce642 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -52,7 +52,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, void xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen); + unsigned busy_gen, uint32_t alloc_flags); void xfs_extent_busy_wait_all(struct xfs_mount *mp); -- Gitee From 36d89f8a57fe1d1b6ffac7d7d0e329d45f6a2961 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 28 Jun 2023 11:04:33 -0700 Subject: [PATCH 5/6] xfs: allow extent free intents to be retried ANBZ: #24564 commit 0853b5de42b471a92f4ff128a8757b87427d2431 upstream. Extent freeing neeeds to be able to avoid a busy extent deadlock when the transaction itself holds the only busy extents in the allocation group. This may occur if we have an EFI that contains multiple extents to be freed, and the freeing the second intent requires the space the first extent free released to expand the AGFL. If we block on the busy extent at this point, we deadlock. We hold a dirty transaction that contains a entire atomic extent free operations within it, so if we can abort the extent free operation and commit the progress that we've made, the busy extent can be resolved by a log force. Hence we can restart the aborted extent free with a new transaction and continue to make progress without risking deadlocks. To enable this, we need the EFI processing code to be able to handle an -EAGAIN error to tell it to commit the current transaction and retry again. This mechanism is already built into the defer ops processing (used bythe refcount btree modification intents), so there's relatively little handling we need to add to the EFI code to enable this. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Reviewed-by: Chandan Babu R Conflicts: fs/xfs/xfs_extfree_item.c Signed-off-by: Gao Xiang --- fs/xfs/xfs_extfree_item.c | 68 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 67 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index c8e615f7ce09..2d25aa62ab78 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -347,6 +347,34 @@ xfs_trans_get_efd( return efdp; } +/* + * Fill the EFD with all extents from the EFI when we need to roll the + * transaction and continue with a new EFI. + * + * This simply copies all the extents in the EFI to the EFD rather than make + * assumptions about which extents in the EFI have already been processed. We + * currently keep the xefi list in the same order as the EFI extent list, but + * that may not always be the case. Copying everything avoids leaving a landmine + * were we fail to cancel all the extents in an EFI if the xefi list is + * processed in a different order to the extents in the EFI. + */ +static void +xfs_efd_from_efi( + struct xfs_efd_log_item *efdp) +{ + struct xfs_efi_log_item *efip = efdp->efd_efip; + uint i; + + ASSERT(efip->efi_format.efi_nextents > 0); + ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents); + + for (i = 0; i < efip->efi_format.efi_nextents; i++) { + efdp->efd_format.efd_extents[i] = + efip->efi_format.efi_extents[i]; + } + efdp->efd_next_extent = efip->efi_format.efi_nextents; +} + /* * Free an extent and log it to the EFD. Note that the transaction is marked * dirty regardless of whether the extent free succeeds or fails to support the @@ -390,6 +418,17 @@ xfs_trans_free_extent( tp->t_flags |= XFS_TRANS_DIRTY; set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags); + /* + * If we need a new transaction to make progress, the caller will log a + * new EFI with the current contents. It will also log an EFD to cancel + * the existing EFI, and so we need to copy all the unprocessed extents + * in this EFI to the EFD so this works correctly. + */ + if (error == -EAGAIN) { + xfs_efd_from_efi(efdp); + return error; + } + next_extent = efdp->efd_next_extent; ASSERT(next_extent < efdp->efd_format.efd_nextents); extp = &(efdp->efd_format.efd_extents[next_extent]); @@ -487,6 +526,14 @@ xfs_extent_free_finish_item( xefi = container_of(item, struct xfs_extent_free_item, xefi_list); error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); + + /* + * Don't free the XEFI if we need a new transaction to complete + * processing of it. + */ + if (error == -EAGAIN) + return error; + kmem_cache_free(xfs_extfree_item_cache, xefi); return error; } @@ -601,6 +648,7 @@ xfs_efi_item_recover( xfs_fsblock_t startblock_fsb; int i; int error = 0; + bool requeue_only = false; /* * First check the validity of the extents described by the @@ -634,7 +682,25 @@ xfs_efi_item_recover( fake.xefi_startblock = extp->ext_start; fake.xefi_blockcount = extp->ext_len; - error = xfs_trans_free_extent(tp, efdp, &fake); + if (!requeue_only) + error = xfs_trans_free_extent(tp, efdp, &fake); + + /* + * If we can't free the extent without potentially deadlocking, + * requeue the rest of the extents to a new so that they get + * run again later with a new transaction context. + */ + if (error == -EAGAIN || requeue_only) { + error = xfs_free_extent_later(tp, fake.xefi_startblock, + fake.xefi_blockcount, + &XFS_RMAP_OINFO_ANY_OWNER, + fake.xefi_agresv); + if (!error) { + requeue_only = true; + continue; + } + }; + if (error) goto abort_error; -- Gitee From 300b1d0dbef2e3ecdcd56445818c7c4f551f61b3 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 28 Jun 2023 11:04:33 -0700 Subject: [PATCH 6/6] xfs: don't block in busy flushing when freeing extents ANBZ: #24564 commit 8ebbf262d4684e035af5e7aa2a71cab636673a9b upstream. If the current transaction holds a busy extent and we are trying to allocate a new extent to fix up the free list, we can deadlock if the AG is entirely empty except for the busy extent held by the transaction. This can occur at runtime processing an XEFI with multiple extents in this path: __schedule+0x22f at ffffffff81f75e8f schedule+0x46 at ffffffff81f76366 xfs_extent_busy_flush+0x69 at ffffffff81477d99 xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a xfs_alloc_ag_vextent+0x19b at ffffffff81417edb xfs_alloc_fix_freelist+0x22f at ffffffff8141896f xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a __xfs_free_extent+0x99 at ffffffff81419499 xfs_trans_free_extent+0x3e at ffffffff814a6fee xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 xfs_defer_finish+0x11 at ffffffff814417e1 xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd xfs_inactive_truncate+0xb9 at ffffffff8148bb89 xfs_inactive+0x227 at ffffffff8148c4f7 xfs_fs_destroy_inode+0xb8 at ffffffff81496898 destroy_inode+0x3b at ffffffff8127d2ab do_unlinkat+0x1d1 at ffffffff81270df1 do_syscall_64+0x40 at ffffffff81f6b5f0 entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c This can also happen in log recovery when processing an EFI with multiple extents through this path: context_switch() kernel/sched/core.c:3881 __schedule() kernel/sched/core.c:5111 schedule() kernel/sched/core.c:5186 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 xfs_mountfs() fs/xfs/xfs_mount.c:978 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 mount_bdev() fs/super.c:1417 xfs_fs_mount() fs/xfs/xfs_super.c:1985 legacy_get_tree() fs/fs_context.c:647 vfs_get_tree() fs/super.c:1547 do_new_mount() fs/namespace.c:2843 do_mount() fs/namespace.c:3163 ksys_mount() fs/namespace.c:3372 __do_sys_mount() fs/namespace.c:3386 __se_sys_mount() fs/namespace.c:3383 __x64_sys_mount() fs/namespace.c:3383 do_syscall_64() arch/x86/entry/common.c:296 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 To avoid this deadlock, we should not block in xfs_extent_busy_flush() if we hold a busy extent in the current transaction. Now that the EFI processing code can handle requeuing a partially completed EFI, we can detect this situation in xfs_extent_busy_flush() and return -EAGAIN rather than going to sleep forever. The -EAGAIN get propagated back out to the xfs_trans_free_extent() context, where the EFD is populated and the transaction is rolled, thereby moving the busy extents into the CIL. At this point, we can retry the extent free operation again with a clean transaction. If we hit the same "all free extents are busy" situation when trying to fix up the free list, we can safely call xfs_extent_busy_flush() and wait for the busy extents to resolve and wake us. At this point, the allocation search can make progress again and we can fix up the free list. This deadlock was first reported by Chandan in mid-2021, but I couldn't make myself understood during review, and didn't have time to fix it myself. It was reported again in March 2023, and again I have found myself unable to explain the complexities of the solution needed during review. As such, I don't have hours more time to waste trying to get the fix written the way it needs to be written, so I'm just doing it myself. This patchset is largely based on Wengang Wang's last patch, but with all the unnecessary stuff removed, split up into multiple patches and cleaned up somewhat. Reported-by: Chandan Babu R Reported-by: Wengang Wang Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Gao Xiang --- fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++----------- fs/xfs/libxfs/xfs_alloc.h | 11 ++++--- fs/xfs/xfs_extent_busy.c | 33 ++++++++++++++++--- fs/xfs/xfs_extent_busy.h | 6 ++-- 4 files changed, 88 insertions(+), 30 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 99cb92bd6216..ebb3337e3393 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1589,6 +1589,8 @@ xfs_alloc_ag_vextent_near( if (args->agbno > args->max_agbno) args->agbno = args->max_agbno; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: len = 0; @@ -1644,9 +1646,20 @@ xfs_alloc_ag_vextent_near( */ if (!acur.len) { if (acur.busy) { + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_near_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - acur.busy_gen, alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + acur.busy_gen, alloc_flags); + if (error) + goto out; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; goto restart; } trace_xfs_alloc_size_neither(args); @@ -1686,6 +1699,8 @@ xfs_alloc_ag_vextent_size( int error; int i; + /* Retry once quickly if we find busy extents before blocking. */ + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; restart: /* * Allocate and initialize a cursor for the by-size btree. @@ -1744,19 +1759,25 @@ xfs_alloc_ag_vextent_size( error = xfs_btree_increment(cnt_cur, 0, &i); if (error) goto error0; - if (i == 0) { - /* - * Our only valid extents must have been busy. - * Make it unbusy by forcing the log out and - * retrying. - */ - xfs_btree_del_cursor(cnt_cur, - XFS_BTREE_NOERROR); - trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, - busy_gen, alloc_flags); - goto restart; - } + if (i) + continue; + + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ + trace_xfs_alloc_size_busy(args); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + goto restart; } } @@ -1836,10 +1857,21 @@ xfs_alloc_ag_vextent_size( args->len = rlen; if (rlen < args->minlen) { if (busy) { - xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); + /* + * Our only valid extents must have been busy. Flush and + * retry the allocation again. If we get an -EAGAIN + * error, we're being told that a deadlock was avoided + * and the current transaction needs committing before + * the allocation can be retried. + */ trace_xfs_alloc_size_busy(args); - xfs_extent_busy_flush(args->mp, args->pag, busy_gen, - alloc_flags); + error = xfs_extent_busy_flush(args->tp, args->pag, + busy_gen, alloc_flags); + if (error) + goto error0; + + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); goto restart; } goto out_nominleft; diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 711cf16da35b..84a8b7b9db0d 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -38,11 +38,12 @@ typedef unsigned int xfs_alloctype_t; /* * Flags for xfs_alloc_fix_freelist. */ -#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */ -#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/ -#define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ -#define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ -#define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ +#define XFS_ALLOC_FLAG_FREEING (1U << 1) /* indicate caller is freeing extents*/ +#define XFS_ALLOC_FLAG_NORMAP (1U << 2) /* don't modify the rmapbt */ +#define XFS_ALLOC_FLAG_NOSHRINK (1U << 3) /* don't shrink the freelist */ +#define XFS_ALLOC_FLAG_CHECK (1U << 4) /* test only, don't modify args */ +#define XFS_ALLOC_FLAG_TRYFLUSH (1U << 5) /* don't wait in busy extent flush */ /* * Argument structure for xfs_alloc routines. diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index 8235cdcc8b1c..b19bff8d89f0 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -577,10 +577,21 @@ xfs_extent_busy_clear( /* * Flush out all busy extents for this AG. + * + * If the current transaction is holding busy extents, the caller may not want + * to wait for committed busy extents to resolve. If we are being told just to + * try a flush or progress has been made since we last skipped a busy extent, + * return immediately to allow the caller to try again. + * + * If we are freeing extents, we might actually be holding the only free extents + * in the transaction busy list and the log force won't resolve that situation. + * In this case, we must return -EAGAIN to avoid a deadlock by informing the + * caller it needs to commit the busy extents it holds before retrying the + * extent free operation. */ -void +int xfs_extent_busy_flush( - struct xfs_mount *mp, + struct xfs_trans *tp, struct xfs_perag *pag, unsigned busy_gen, uint32_t alloc_flags) @@ -588,10 +599,23 @@ xfs_extent_busy_flush( DEFINE_WAIT (wait); int error; - error = xfs_log_force(mp, XFS_LOG_SYNC); + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); if (error) - return; + return error; + + /* Avoid deadlocks on uncommitted busy extents. */ + if (!list_empty(&tp->t_busy)) { + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) + return 0; + + if (busy_gen != READ_ONCE(pag->pagb_gen)) + return 0; + + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) + return -EAGAIN; + } + /* Wait for committed busy extents to resolve. */ do { prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); if (busy_gen != READ_ONCE(pag->pagb_gen)) @@ -600,6 +624,7 @@ xfs_extent_busy_flush( } while (1); finish_wait(&pag->pagb_wait, &wait); + return 0; } void diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index d70d395ce642..f145d5320519 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -50,9 +50,9 @@ bool xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, xfs_extlen_t *len, unsigned *busy_gen); -void -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, - unsigned busy_gen, uint32_t alloc_flags); +int +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, + unsigned busy_gen, uint32_t alloc_flags); void xfs_extent_busy_wait_all(struct xfs_mount *mp); -- Gitee