Skip to main content

Search

Items tagged with: kernel


 

linux driver backports tip


If you ever have to directly use the backports repo (rather than using a driver backports tarball generated by a vendor or distro) check out the --gitdebug feature
https://backports.wiki.kernel.org/index.php/Documentation/integration#Integrating_a_release

This option is only mentioned in docs for "integration" workflow, but is available for both driver backporting workflows.

Image/Photo

#linux #kernel #drivers #embedded

 

Libcamera: V4L2 successor


Libcamera successor to V4L2 hopes to ease embedded Linux camera headaches

https://youtu.be/GIhV7tiUji0

#embedded #linux #kernel #programming
Libcamera successor to V4L2 hopes to ease embedded Linux camera headaches

 

But is it atomic?


So a few days ago, a colleague asked “Why do we love files on disk?” and in the course of that discussion, I made a comment that, among other things, used the assumption that somebody is updating some file on some Linux system atomically. I wrote:
Let's assume we are using local files, and we do so in a managed, sane way:  

- All these state files are always JSON,  
- there is a JSON schema, so  
-- it is clear which attributes can be there,  
-- must be there, and  
-- what they mean and 
-- what changes to data mean as well.  
- **Files are updated atomically**

And immediately the question came up: “I either misunderstand you or I have a gap in the knowledge. When writes to a file became atomic? They are not in general case.”

There is a Dilbert for that:
https://dilbert.com/strip/1995-06-24

So let’s go back in time, it’s Greybeards time! We’re going to find out where the things you are working with are actually coming from. With sources and references.

The write(2) system call


A write(2) system call is atomic. The size or amount of data written does not matter. How come?
The system call will, before trying to write data to disk, lock the in-memory inode. That is, it will effectively lock the entire file. It then performs the file changes, and only then unlocks. That can take a long, long, long time, depending on the amount of data and the media the file is being stored on.
It means that on a single physical file in Unix, there can be only one write(2) system call active at any point in time.
One exception to this is XFS, but only when a file is opened with O_DIRECT. In this special case, XFS instead locks the byte range in a structure attached to the inode, performs the write and then unlocks. So, in XFS with O_DIRECT, any number of concurrent, non-overlapping write(2) system calls can be active.
The Posix specification requires that write(2) is atomic, it does not require that only one write per file can happen.

That is a horrible thing!


The locking behavior of write(2) is a problem for databases that require many concurrent writes to happen.
Ingres and some other early SQL databases used to solve that problem by avoiding filesystem entirely, they recommended that tablespaces use raw disks. No filesystem, no files, no locks.
Oracle solved the problem by introducing the concept of tablespaces, which are data storage spaces made up by a fleet of files, e.g. one file for each GB of data storage. Tables are assigned tablespaces, not data files directly. Since there is one write lock per inode, concurrent writes to different files in the same tablespace can happen.
Only in 1994, when SGI published XFS, the actual problem was tackled by splitting the lock at the kernel level for buffer cache less writes. XFS also contained many other improvements over the 1984 BSD Fast Filing System that made it superior for concurrent I/O, streaming I/O, very large file systems, and many other modern use-cases. BSD FFS was in turn an improvement over 1974’s original Unix Filesystem.
In Linux terms, the 1974 Unix Filesystem is mirrored by the Minix File system, the 1984 BSD FFS is roughly equivalent to ext2, and XFS was donated and ported to Linux by SGI, bringing that up into the tech level of 1994.
Sun ZFS and Linux Btrfs are from 2004, and are a complete deviation from earlier Unix ideas. They are a different, much longer writeup, which will actually end with the git and the Blockchain.

Source Dive: Why are writes atomic?


“Posix requiring a file write to be atomic” comes from the behavior of the original Version 7 Unix and later systems. In there, we find the write(2) system call, which just calls the rdwr() function.
/\* 
 \* write system call  
 \*/ 
write() 
{  
  rdwr(FWRITE); 
}

You are looking very old K&R style C code here, which predates even ANSI-C and function prototypes, by the way.
So rdwr() a few lines down the function calls plock(), for as long as we are not dealing with a device special file (Here is where the Ingres “use raw devices” idea comes into play), then does the I/O and finally calls prele().
if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            plock(ip); 
        if(mode == FREAD) 
            readi(ip); 
        else 
            writei(ip); 
        if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            prele(ip);

plock() is what locks the actual inode and the origin of the observed behavior. It is is a misnomer, it’s not a pipe lock, it’s an inode lock.
/\* 
 \* Lock a pipe. 
 \* If its already locked, 
 \* set the WANT bit and sleep. 
 \*/ 
plock(ip) 
register struct inode \*ip; 
{ 


    while(ip->i_flag&ILOCK) { 
        ip->i_flag |= IWANT; 
        sleep((caddr_t)ip, PINOD); 
    } 
    ip->i_flag |= ILOCK; 
}

See the locking loop here: As as we do not have the lock, indicate desire to get the lock, then sleep on a lock release. When we exit the loop (because the inode is unlocked), lock the inode.
These are simple C Code lines, not special magic macros that translate into special magic TAS machine instructions. That is because the code here is so old that it comes from a world where we have single-die, single-core, single-thread CPUs. If your code is actually running (and this is kernel code!), then you are alone in the entire system. There is nobody else touching these variables as long as you have the CPU.
Under the lock, rdwr() above calls writei(). And writei() has a do loop which uses variables from the u-Area.
do { 
        bn = u.u_offset >> BSHIFT; 
        on = u.u_offset & BMASK; 
        n = min((unsigned)(BSIZE-on), u.u_count); 
        if (type!=IFBLK && type!=IFMPB) { 
            bn = bmap(ip, bn, B_WRITE); 
            if((long)bn<0) 
                return; 
            dev = ip->i_dev; 
        } 
        if(n == BSIZE)  
            bp = getblk(dev, bn); 
        else 
            bp = bread(dev, bn); 
        iomove(bp->b_un.b_addr+on, n, B_WRITE); 
        if(u.u_error != 0) 
            brelse(bp); 
        else 
            bdwrite(bp); 
        if(u.u_offset > ip->i_size && 
           (type==IFDIR || type==IFREG)) 
            ip->i_size = u.u_offset; 
        ip->i_flag |= IUPD|ICHG; 
    } while(u.u_error==0 && u.u_count!=0); 
}

The u-Area of a process at that time was a common data structure that the userland and the kernel used to communicate. Here it is being used to shift syscall parameters into the kernel. The write writes the data at u.u_base in userland into the current inode, at u.u_offset bytes in the file. There are u.u_count many bytes to write.
We convert the u.u_offset into a logical block number (the n-th block of a file), and an offset “on” within the block. We need to call bmap(). This function turns an inode number and block number within the file into a physical block number on a device.
We can then bring the relevant physical block into the buffer cache, using bread(), and then use iomove() to modify and dirty the block. As we brelse() it, it will eventually be written back to disk later.

There is an optimization here:
When the write is a full block, we do not read the block from disk. We just allocate a buffer using getblk(), and fill it. It will overwrite the data on disk completely, there is no old and new data to merge. Disk accesses are slow, in the 1970ies even more so than today, so not reading data that you are going to obliterate completely pays of substantially.
The loop continues as long as there are no errors and still blocks to write.
As we return from writei(), rdrw() resumes and will eventually prele() the inode lock.

How old is this stuff?


This is of course extremely old code, original V7 unix, almost as old as me: git blames places it’s age at 41 years, I was in the third class of a German basic school when this was written.
I chose this implementation, because it is very simple and because it is also what became immortalised in the performance destroying standard which we got to know as Posix File System Semantics.

Homework


You can have fun to find the matching functionality in a modern Linux kernel, threads, multicore, capabilities, namespaces, cgroups and dynamic data structures and all.
Compare code readability and complexity. Discuss. Is this progress? Why do you think so?
You can try to get a copy of “The Design of the Unix Operating System” by Maurice J. Bach. It will take you on a guided tour through the origins of our craft and the legacy we build on. The topics discussed in this note can be found on the pages 101ff, “WRITE” and “FILE AND RECORD LOCKING”.
If you are into operating systems, continue reading after Bach: “The Design and Implementation of the 4.3 BSD Operating System” builds on Bach’s work and showcases the progress and inventions that Kirk McKusick, Sam Leffler et al made after that.
If you are into comparative operating system design, read “Inside Windows NT” by Helen Custer after Bach and Leffler/McKusick, and try to understand the different ideas and world view behind that.

But we don’t use write(2) for atomic file updates!


Well, some of us do, but I agree that it is hard to get right: write(2) and writev(2) are very hard to handle properly in applications, as you need to write everything in a single call.
Most programs use another atomic operation in Unix, the rename(2) system call. You write file.new at your leisure, printf(), chunked writes() and all. When completed, rename file.new to file. This automatically unlinks the old version of file as well.
This is also the recommended approach to atomicity, because unlike write(2) it is stable in the face of the dreaded nightmare file system.
rename(2) was introduced really early in BSD Unix because of specific race problems in the V7 Unix early BSD patched and improved.
Before BSD, we only had link(2) and unlink(2). You can use a combination of these syscalls to implement a rename-like operation, but you need more than one syscall to do that.
In Unix, at the end of a syscall, before return to userland, the scheduler runs (Bach, Chapter 8). That is, at the end of each syscall, a process can be forced to yield the CPU. This is the cause for potential race conditions when not having a rename(2) as a single syscall, and that is why BSD came up with a single syscall for renaming files in the first place.
Renaming files for atomic updates can be taken to an art form: trylooking into the Maildir/ implementations as invented by qmail, and implemented in Dovecot and CyrusAnd that concludes this issue of Our Systems Legacy.

#erklaerbaer #wasmitcomputern #unix #linux #kernel

 
Hey followers, do you know an easy & universal way how inexperienced users can decode why their #linux #kernel got tainted? Dmesg might not have all the reasons anymore, journalctl might not be in use, and /proc/sys/kernel/tainted is hard to decode if tainted for multiple reasons .

Asking for a friend stupid enough to do regression tracking for the #Linux #kernel now and then in his rare spare time. That's why he is writing a major update for https://www.kernel.org/doc/html/latest/admin-guide/reporting-bugs.html currently. Screenshot from https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html


 
"[…] We are trialing out a new feature that can send you a notification when the patches you send to the #LKML are applied to linux-next or to the mainline git trees. If you are interested in trying it out, here are the details: […]" #linux #kernel kernel.org/get-notificati…


 

 
#Linux #Kernel 4.19.9, 4.14.88, 4.9.145, 4.4.167, 3.18.129 are out, bringing the usual pile of fixes and small improvements along with the "you just update" advice. Among the changes in 4.19.9 are a few #Amdgpu patches that improve support for the #Radeon RX 590.


 
Image/PhotoThorsten 'the Linux kernel logger' Leemhuis(6/6) wrote the following post Sat, 08 Dec 2018 13:31:34 +0100

#Linux #kernel 4.19.8 is out. It fixes the data corruption issue in blk-mq that made the news recently and was alleged to ext4 in the beginning.

Note: The fix applied on Wednesday was basically reverted today and replaced by a better one. For details see git.kernel.org/pub/scm/linux/…



kernel/git/stable/linux.git - Linux kernel stable tree #Linux

 
#Linux 4.19.8 with the new fix for the data corruption issue in blk-mq is now available in my #Kernel vanilla repositories for #Fedora 28 and 29: fedoraproject.org/wiki/Kernel_Va…
Side note TWIMC: The current kernel from updates-testing contains a fix, too, but it's the older one.

 
BTW, #Linux 4.14.87 and 4.9.144 are out, too. All those new #Longterm #kernel bring the usual pile of fixes and small improvements, so you must update.

Site node: If you do not know which kernel version line is the best for your machines go and read kroah.com/log/blog/2018/…

 
#Linux #kernel 4.19.8 is out. It fixes the data corruption issue in blk-mq that made the news recently and was alleged to ext4 in the beginning.

Note: The fix applied on Wednesday was basically reverted today and replaced by a better one. For details see git.kernel.org/pub/scm/linux/…


 
The fix for the data corruption in blk-mq (alleged to #ext4 earlier) was "basically reverted" by a newer fix in #linux mainline; it takes a different angle and fixes a livelock the earlier patch created: "blk-mq: punt failed direct issue to dispatch list" git.kernel.org/torvalds/c/c61…

 
The latest DRM merge in #Linux #kernel mainline (and thus for 4.20) contains a few patches that are supposed to make Radeon RX 590 cards work better. There are also some fixes for Vega20. git.kernel.org/torvalds/c/d38…

 
#Linux 4.19.7 since earlier today is available in my #Kernel #Vanilla repositories for #Fedora: fedoraproject.org/wiki/Kernel_Va…

One of the changes fixes a PCIe problem that can hit #Amdgpu: twitter.com/kernellogger/s…

Looks like the fix for the data corruption problems will be in 4.19.8.

 
The fix for the block issue that among others resulted in #ext4 filesystems errors is now in the #Linux mainline #kernel; it's tagged stable and thus should soon get merged to 4.19.x:
git.kernel.org/torvalds/c/4ea…
git.kernel.org/torvalds/c/ffe…

 
#Linux #Kernel 4.19.7 and 4.14.86 are out. They contain the improved #spectre v2 mitigation that has no big negative performance impact unless you activate the best protection. For details about these changes including new and improved kernel boot parameters see below thread.

 
Das in den letzten Tagen viel diskutierte #Ext4-Problem in #Linux 4.19, das Daten beschädigten kann, ist in Wirklichkeit ein Problem im neueren Block-Layer. Ein Korrektur ist auf den Weg in 4.20-rc und dürfte dann wohl auch bald in den Stable #Kernel 4.19 zurück fließen.

 
Looks like the (in)famous "#ext4 problems" in the #linux #kernel 4.19.x series happen due to a problem in the block layer (blk-mq to be precise). See
lore.kernel.org/linux-block/1d… for the patch that fixes it and bugzilla.kernel.org/show_bug.cgi?i… for some more details.


 
Linus complained about naming in one of the patches from the recent #Linux #kernel spectre v2 mitigation rework (): '[…] The above is not STIBP. It's just "disable IB". There's nothing "ST" about it.[…]' Here is Intels explanation: lore.kernel.org/lkml/84a1bd5c-…





 
#Linux #kernel 4.19.7 and 4.14.86 are under review now. They contain patches recently merged to mainline (see below thread for details) that improve #spectrev2 protection without the big performance impact an earlier attempt had.
lore.kernel.org/lkml/201812041…
lore.kernel.org/lkml/201812041…

 

Kernel equivalent to DESTDIR


Equivalent variable for install prefix is: INSTALL_MOD_PATH

Too bad it isn't just DESTDIR, since it would be easier for me to remember, but at least it is listed in Documentation/kbuild/modules.txt (though sometimes I just poke around in Makefile itself until I re-discover it)

https://www.kernel.org/doc/Documentation/kbuild/modules.txt

#linux #kernel #programming

 
For those that missed it over the weekend: The performance problem in the #Linux mainline #Kernel that happened due to improved #Spectre v2 protection was fixed over the weekend. A lot more details about this can be found in below thread from my Alter Ego @kernellogger

 
Das durch verbesserte #Spectre-v2 Gegenmaßnahmen entstandene Performance-Problem im #Linux mainline #Kernel sollte im Entwicklerzweig bzw. mit dem heute erwarteten 4.20-rc5 aus der Welt sein. Details dazu finden sich in folgendem Thread meines Alter Ego @kernellogger

 
The (in)famous #kernel performance regression in #Linux mainline/4.20-rc got fixed, as Linus merged the patch series that reworks the #STIBP stuff that helps mitigating #spectre v2. It's CCed stable and thus will show up in stable & longterm kernels, too. git.kernel.org/torvalds/c/4b7…


 
How we #spent two #weeks #hunting an #NFS #bug in the #Linux #kernel


Here's an in-depth recap of debugging a GitLab issue that culminated in a patch for the Linux kernel.
https://about.gitlab.com/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/

 
How we #spent two #weeks #hunting an #NFS #bug in the #Linux #kernel


Here's an in-depth recap of debugging a GitLab issue that culminated in a patch for the Linux kernel.
https://about.gitlab.com/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/

 

But is it atomic?


So a few days ago, a colleague asked “Why do we love files on disk?” and in the course of that discussion, I made a comment that, among other things, used the assumption that somebody is updating some file on some Linux system atomically. I wrote:
Let's assume we are using local files, and we do so in a managed, sane way:  

- All these state files are always JSON,  
- there is a JSON schema, so  
-- it is clear which attributes can be there,  
-- must be there, and  
-- what they mean and 
-- what changes to data mean as well.  
- **Files are updated atomically**

And immediately the question came up: “I either misunderstand you or I have a gap in the knowledge. When writes to a file became atomic? They are not in general case.”

There is a Dilbert for that:
https://dilbert.com/strip/1995-06-24

So let’s go back in time, it’s Greybeards time! We’re going to find out where the things you are working with are actually coming from. With sources and references.

The write(2) system call


A write(2) system call is atomic. The size or amount of data written does not matter. How come?
The system call will, before trying to write data to disk, lock the in-memory inode. That is, it will effectively lock the entire file. It then performs the file changes, and only then unlocks. That can take a long, long, long time, depending on the amount of data and the media the file is being stored on.
It means that on a single physical file in Unix, there can be only one write(2) system call active at any point in time.
One exception to this is XFS, but only when a file is opened with O_DIRECT. In this special case, XFS instead locks the byte range in a structure attached to the inode, performs the write and then unlocks. So, in XFS with O_DIRECT, any number of concurrent, non-overlapping write(2) system calls can be active.
The Posix specification requires that write(2) is atomic, it does not require that only one write per file can happen.

That is a horrible thing!


The locking behavior of write(2) is a problem for databases that require many concurrent writes to happen.
Ingres and some other early SQL databases used to solve that problem by avoiding filesystem entirely, they recommended that tablespaces use raw disks. No filesystem, no files, no locks.
Oracle solved the problem by introducing the concept of tablespaces, which are data storage spaces made up by a fleet of files, e.g. one file for each GB of data storage. Tables are assigned tablespaces, not data files directly. Since there is one write lock per inode, concurrent writes to different files in the same tablespace can happen.
Only in 1994, when SGI published XFS, the actual problem was tackled by splitting the lock at the kernel level for buffer cache less writes. XFS also contained many other improvements over the 1984 BSD Fast Filing System that made it superior for concurrent I/O, streaming I/O, very large file systems, and many other modern use-cases. BSD FFS was in turn an improvement over 1974’s original Unix Filesystem.
In Linux terms, the 1974 Unix Filesystem is mirrored by the Minix File system, the 1984 BSD FFS is roughly equivalent to ext2, and XFS was donated and ported to Linux by SGI, bringing that up into the tech level of 1994.
Sun ZFS and Linux Btrfs are from 2004, and are a complete deviation from earlier Unix ideas. They are a different, much longer writeup, which will actually end with the git and the Blockchain.

Source Dive: Why are writes atomic?


“Posix requiring a file write to be atomic” comes from the behavior of the original Version 7 Unix and later systems. In there, we find the write(2) system call, which just calls the rdwr() function.
/\* 
 \* write system call  
 \*/ 
write() 
{  
  rdwr(FWRITE); 
}

You are looking very old K&R style C code here, which predates even ANSI-C and function prototypes, by the way.
So rdwr() a few lines down the function calls plock(), for as long as we are not dealing with a device special file (Here is where the Ingres “use raw devices” idea comes into play), then does the I/O and finally calls prele().
if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            plock(ip); 
        if(mode == FREAD) 
            readi(ip); 
        else 
            writei(ip); 
        if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            prele(ip);

plock() is what locks the actual inode and the origin of the observed behavior. It is is a misnomer, it’s not a pipe lock, it’s an inode lock.
/\* 
 \* Lock a pipe. 
 \* If its already locked, 
 \* set the WANT bit and sleep. 
 \*/ 
plock(ip) 
register struct inode \*ip; 
{ 


    while(ip->i_flag&ILOCK) { 
        ip->i_flag |= IWANT; 
        sleep((caddr_t)ip, PINOD); 
    } 
    ip->i_flag |= ILOCK; 
}

See the locking loop here: As as we do not have the lock, indicate desire to get the lock, then sleep on a lock release. When we exit the loop (because the inode is unlocked), lock the inode.
These are simple C Code lines, not special magic macros that translate into special magic TAS machine instructions. That is because the code here is so old that it comes from a world where we have single-die, single-core, single-thread CPUs. If your code is actually running (and this is kernel code!), then you are alone in the entire system. There is nobody else touching these variables as long as you have the CPU.
Under the lock, rdwr() above calls writei(). And writei() has a do loop which uses variables from the u-Area.
do { 
        bn = u.u_offset >> BSHIFT; 
        on = u.u_offset & BMASK; 
        n = min((unsigned)(BSIZE-on), u.u_count); 
        if (type!=IFBLK && type!=IFMPB) { 
            bn = bmap(ip, bn, B_WRITE); 
            if((long)bn<0) 
                return; 
            dev = ip->i_dev; 
        } 
        if(n == BSIZE)  
            bp = getblk(dev, bn); 
        else 
            bp = bread(dev, bn); 
        iomove(bp->b_un.b_addr+on, n, B_WRITE); 
        if(u.u_error != 0) 
            brelse(bp); 
        else 
            bdwrite(bp); 
        if(u.u_offset > ip->i_size && 
           (type==IFDIR || type==IFREG)) 
            ip->i_size = u.u_offset; 
        ip->i_flag |= IUPD|ICHG; 
    } while(u.u_error==0 && u.u_count!=0); 
}

The u-Area of a process at that time was a common data structure that the userland and the kernel used to communicate. Here it is being used to shift syscall parameters into the kernel. The write writes the data at u.u_base in userland into the current inode, at u.u_offset bytes in the file. There are u.u_count many bytes to write.
We convert the u.u_offset into a logical block number (the n-th block of a file), and an offset “on” within the block. We need to call bmap(). This function turns an inode number and block number within the file into a physical block number on a device.
We can then bring the relevant physical block into the buffer cache, using bread(), and then use iomove() to modify and dirty the block. As we brelse() it, it will eventually be written back to disk later.

There is an optimization here:
When the write is a full block, we do not read the block from disk. We just allocate a buffer using getblk(), and fill it. It will overwrite the data on disk completely, there is no old and new data to merge. Disk accesses are slow, in the 1970ies even more so than today, so not reading data that you are going to obliterate completely pays of substantially.
The loop continues as long as there are no errors and still blocks to write.
As we return from writei(), rdrw() resumes and will eventually prele() the inode lock.

How old is this stuff?


This is of course extremely old code, original V7 unix, almost as old as me: git blames places it’s age at 41 years, I was in the third class of a German basic school when this was written.
I chose this implementation, because it is very simple and because it is also what became immortalised in the performance destroying standard which we got to know as Posix File System Semantics.

Homework


You can have fun to find the matching functionality in a modern Linux kernel, threads, multicore, capabilities, namespaces, cgroups and dynamic data structures and all.
Compare code readability and complexity. Discuss. Is this progress? Why do you think so?
You can try to get a copy of “The Design of the Unix Operating System” by Maurice J. Bach. It will take you on a guided tour through the origins of our craft and the legacy we build on. The topics discussed in this note can be found on the pages 101ff, “WRITE” and “FILE AND RECORD LOCKING”.
If you are into operating systems, continue reading after Bach: “The Design and Implementation of the 4.3 BSD Operating System” builds on Bach’s work and showcases the progress and inventions that Kirk McKusick, Sam Leffler et al made after that.
If you are into comparative operating system design, read “Inside Windows NT” by Helen Custer after Bach and Leffler/McKusick, and try to understand the different ideas and world view behind that.

But we don’t use write(2) for atomic file updates!


Well, some of us do, but I agree that it is hard to get right: write(2) and writev(2) are very hard to handle properly in applications, as you need to write everything in a single call.
Most programs use another atomic operation in Unix, the rename(2) system call. You write file.new at your leisure, printf(), chunked writes() and all. When completed, rename file.new to file. This automatically unlinks the old version of file as well.
This is also the recommended approach to atomicity, because unlike write(2) it is stable in the face of the dreaded nightmare file system.
rename(2) was introduced really early in BSD Unix because of specific race problems in the V7 Unix early BSD patched and improved.
Before BSD, we only had link(2) and unlink(2). You can use a combination of these syscalls to implement a rename-like operation, but you need more than one syscall to do that.
In Unix, at the end of a syscall, before return to userland, the scheduler runs (Bach, Chapter 8). That is, at the end of each syscall, a process can be forced to yield the CPU. This is the cause for potential race conditions when not having a rename(2) as a single syscall, and that is why BSD came up with a single syscall for renaming files in the first place.
Renaming files for atomic updates can be taken to an art form: trylooking into the Maildir/ implementations as invented by qmail, and implemented in Dovecot and CyrusAnd that concludes this issue of Our Systems Legacy.

#erklaerbaer #wasmitcomputern #unix #linux #kernel

 

But is it atomic?


So a few days ago, a colleague asked “Why do we love files on disk?” and in the course of that discussion, I made a comment that, among other things, used the assumption that somebody is updating some file on some Linux system atomically. I wrote:
Let's assume we are using local files, and we do so in a managed, sane way:  

- All these state files are always JSON,  
- there is a JSON schema, so  
-- it is clear which attributes can be there,  
-- must be there, and  
-- what they mean and 
-- what changes to data mean as well.  
- **Files are updated atomically**

And immediately the question came up: “I either misunderstand you or I have a gap in the knowledge. When writes to a file became atomic? They are not in general case.”

There is a Dilbert for that:
https://dilbert.com/strip/1995-06-24

So let’s go back in time, it’s Greybeards time! We’re going to find out where the things you are working with are actually coming from. With sources and references.

The write(2) system call


A write(2) system call is atomic. The size or amount of data written does not matter. How come?
The system call will, before trying to write data to disk, lock the in-memory inode. That is, it will effectively lock the entire file. It then performs the file changes, and only then unlocks. That can take a long, long, long time, depending on the amount of data and the media the file is being stored on.
It means that on a single physical file in Unix, there can be only one write(2) system call active at any point in time.
One exception to this is XFS, but only when a file is opened with O_DIRECT. In this special case, XFS instead locks the byte range in a structure attached to the inode, performs the write and then unlocks. So, in XFS with O_DIRECT, any number of concurrent, non-overlapping write(2) system calls can be active.
The Posix specification requires that write(2) is atomic, it does not require that only one write per file can happen.

That is a horrible thing!


The locking behavior of write(2) is a problem for databases that require many concurrent writes to happen.
Ingres and some other early SQL databases used to solve that problem by avoiding filesystem entirely, they recommended that tablespaces use raw disks. No filesystem, no files, no locks.
Oracle solved the problem by introducing the concept of tablespaces, which are data storage spaces made up by a fleet of files, e.g. one file for each GB of data storage. Tables are assigned tablespaces, not data files directly. Since there is one write lock per inode, concurrent writes to different files in the same tablespace can happen.
Only in 1994, when SGI published XFS, the actual problem was tackled by splitting the lock at the kernel level for buffer cache less writes. XFS also contained many other improvements over the 1984 BSD Fast Filing System that made it superior for concurrent I/O, streaming I/O, very large file systems, and many other modern use-cases. BSD FFS was in turn an improvement over 1974’s original Unix Filesystem.
In Linux terms, the 1974 Unix Filesystem is mirrored by the Minix File system, the 1984 BSD FFS is roughly equivalent to ext2, and XFS was donated and ported to Linux by SGI, bringing that up into the tech level of 1994.
Sun ZFS and Linux Btrfs are from 2004, and are a complete deviation from earlier Unix ideas. They are a different, much longer writeup, which will actually end with the git and the Blockchain.

Source Dive: Why are writes atomic?


“Posix requiring a file write to be atomic” comes from the behavior of the original Version 7 Unix and later systems. In there, we find the write(2) system call, which just calls the rdwr() function.
/\* 
 \* write system call  
 \*/ 
write() 
{  
  rdwr(FWRITE); 
}

You are looking very old K&R style C code here, which predates even ANSI-C and function prototypes, by the way.
So rdwr() a few lines down the function calls plock(), for as long as we are not dealing with a device special file (Here is where the Ingres “use raw devices” idea comes into play), then does the I/O and finally calls prele().
if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            plock(ip); 
        if(mode == FREAD) 
            readi(ip); 
        else 
            writei(ip); 
        if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            prele(ip);

plock() is what locks the actual inode and the origin of the observed behavior. It is is a misnomer, it’s not a pipe lock, it’s an inode lock.
/\* 
 \* Lock a pipe. 
 \* If its already locked, 
 \* set the WANT bit and sleep. 
 \*/ 
plock(ip) 
register struct inode \*ip; 
{ 


    while(ip->i_flag&ILOCK) { 
        ip->i_flag |= IWANT; 
        sleep((caddr_t)ip, PINOD); 
    } 
    ip->i_flag |= ILOCK; 
}

See the locking loop here: As as we do not have the lock, indicate desire to get the lock, then sleep on a lock release. When we exit the loop (because the inode is unlocked), lock the inode.
These are simple C Code lines, not special magic macros that translate into special magic TAS machine instructions. That is because the code here is so old that it comes from a world where we have single-die, single-core, single-thread CPUs. If your code is actually running (and this is kernel code!), then you are alone in the entire system. There is nobody else touching these variables as long as you have the CPU.
Under the lock, rdwr() above calls writei(). And writei() has a do loop which uses variables from the u-Area.
do { 
        bn = u.u_offset >> BSHIFT; 
        on = u.u_offset & BMASK; 
        n = min((unsigned)(BSIZE-on), u.u_count); 
        if (type!=IFBLK && type!=IFMPB) { 
            bn = bmap(ip, bn, B_WRITE); 
            if((long)bn<0) 
                return; 
            dev = ip->i_dev; 
        } 
        if(n == BSIZE)  
            bp = getblk(dev, bn); 
        else 
            bp = bread(dev, bn); 
        iomove(bp->b_un.b_addr+on, n, B_WRITE); 
        if(u.u_error != 0) 
            brelse(bp); 
        else 
            bdwrite(bp); 
        if(u.u_offset > ip->i_size && 
           (type==IFDIR || type==IFREG)) 
            ip->i_size = u.u_offset; 
        ip->i_flag |= IUPD|ICHG; 
    } while(u.u_error==0 && u.u_count!=0); 
}

The u-Area of a process at that time was a common data structure that the userland and the kernel used to communicate. Here it is being used to shift syscall parameters into the kernel. The write writes the data at u.u_base in userland into the current inode, at u.u_offset bytes in the file. There are u.u_count many bytes to write.
We convert the u.u_offset into a logical block number (the n-th block of a file), and an offset “on” within the block. We need to call bmap(). This function turns an inode number and block number within the file into a physical block number on a device.
We can then bring the relevant physical block into the buffer cache, using bread(), and then use iomove() to modify and dirty the block. As we brelse() it, it will eventually be written back to disk later.

There is an optimization here:
When the write is a full block, we do not read the block from disk. We just allocate a buffer using getblk(), and fill it. It will overwrite the data on disk completely, there is no old and new data to merge. Disk accesses are slow, in the 1970ies even more so than today, so not reading data that you are going to obliterate completely pays of substantially.
The loop continues as long as there are no errors and still blocks to write.
As we return from writei(), rdrw() resumes and will eventually prele() the inode lock.

How old is this stuff?


This is of course extremely old code, original V7 unix, almost as old as me: git blames places it’s age at 41 years, I was in the third class of a German basic school when this was written.
I chose this implementation, because it is very simple and because it is also what became immortalised in the performance destroying standard which we got to know as Posix File System Semantics.

Homework


You can have fun to find the matching functionality in a modern Linux kernel, threads, multicore, capabilities, namespaces, cgroups and dynamic data structures and all.
Compare code readability and complexity. Discuss. Is this progress? Why do you think so?
You can try to get a copy of “The Design of the Unix Operating System” by Maurice J. Bach. It will take you on a guided tour through the origins of our craft and the legacy we build on. The topics discussed in this note can be found on the pages 101ff, “WRITE” and “FILE AND RECORD LOCKING”.
If you are into operating systems, continue reading after Bach: “The Design and Implementation of the 4.3 BSD Operating System” builds on Bach’s work and showcases the progress and inventions that Kirk McKusick, Sam Leffler et al made after that.
If you are into comparative operating system design, read “Inside Windows NT” by Helen Custer after Bach and Leffler/McKusick, and try to understand the different ideas and world view behind that.

But we don’t use write(2) for atomic file updates!


Well, some of us do, but I agree that it is hard to get right: write(2) and writev(2) are very hard to handle properly in applications, as you need to write everything in a single call.
Most programs use another atomic operation in Unix, the rename(2) system call. You write file.new at your leisure, printf(), chunked writes() and all. When completed, rename file.new to file. This automatically unlinks the old version of file as well.
This is also the recommended approach to atomicity, because unlike write(2) it is stable in the face of the dreaded nightmare file system.
rename(2) was introduced really early in BSD Unix because of specific race problems in the V7 Unix early BSD patched and improved.
Before BSD, we only had link(2) and unlink(2). You can use a combination of these syscalls to implement a rename-like operation, but you need more than one syscall to do that.
In Unix, at the end of a syscall, before return to userland, the scheduler runs (Bach, Chapter 8). That is, at the end of each syscall, a process can be forced to yield the CPU. This is the cause for potential race conditions when not having a rename(2) as a single syscall, and that is why BSD came up with a single syscall for renaming files in the first place.
Renaming files for atomic updates can be taken to an art form: trylooking into the Maildir/ implementations as invented by qmail, and implemented in Dovecot and CyrusAnd that concludes this issue of Our Systems Legacy.

#erklaerbaer #wasmitcomputern #unix #linux #kernel

 

But is it atomic?


So a few days ago, a colleague asked “Why do we love files on disk?” and in the course of that discussion, I made a comment that, among other things, used the assumption that somebody is updating some file on some Linux system atomically. I wrote:
Let's assume we are using local files, and we do so in a managed, sane way:  

- All these state files are always JSON,  
- there is a JSON schema, so  
-- it is clear which attributes can be there,  
-- must be there, and  
-- what they mean and 
-- what changes to data mean as well.  
- **Files are updated atomically**

And immediately the question came up: “I either misunderstand you or I have a gap in the knowledge. When writes to a file became atomic? They are not in general case.”

There is a Dilbert for that:
https://dilbert.com/strip/1995-06-24

So let’s go back in time, it’s Greybeards time! We’re going to find out where the things you are working with are actually coming from. With sources and references.

The write(2) system call


A write(2) system call is atomic. The size or amount of data written does not matter. How come?
The system call will, before trying to write data to disk, lock the in-memory inode. That is, it will effectively lock the entire file. It then performs the file changes, and only then unlocks. That can take a long, long, long time, depending on the amount of data and the media the file is being stored on.
It means that on a single physical file in Unix, there can be only one write(2) system call active at any point in time.
One exception to this is XFS, but only when a file is opened with O_DIRECT. In this special case, XFS instead locks the byte range in a structure attached to the inode, performs the write and then unlocks. So, in XFS with O_DIRECT, any number of concurrent, non-overlapping write(2) system calls can be active.
The Posix specification requires that write(2) is atomic, it does not require that only one write per file can happen.

That is a horrible thing!


The locking behavior of write(2) is a problem for databases that require many concurrent writes to happen.
Ingres and some other early SQL databases used to solve that problem by avoiding filesystem entirely, they recommended that tablespaces use raw disks. No filesystem, no files, no locks.
Oracle solved the problem by introducing the concept of tablespaces, which are data storage spaces made up by a fleet of files, e.g. one file for each GB of data storage. Tables are assigned tablespaces, not data files directly. Since there is one write lock per inode, concurrent writes to different files in the same tablespace can happen.
Only in 1994, when SGI published XFS, the actual problem was tackled by splitting the lock at the kernel level for buffer cache less writes. XFS also contained many other improvements over the 1984 BSD Fast Filing System that made it superior for concurrent I/O, streaming I/O, very large file systems, and many other modern use-cases. BSD FFS was in turn an improvement over 1974’s original Unix Filesystem.
In Linux terms, the 1974 Unix Filesystem is mirrored by the Minix File system, the 1984 BSD FFS is roughly equivalent to ext2, and XFS was donated and ported to Linux by SGI, bringing that up into the tech level of 1994.
Sun ZFS and Linux Btrfs are from 2004, and are a complete deviation from earlier Unix ideas. They are a different, much longer writeup, which will actually end with the git and the Blockchain.

Source Dive: Why are writes atomic?


“Posix requiring a file write to be atomic” comes from the behavior of the original Version 7 Unix and later systems. In there, we find the write(2) system call, which just calls the rdwr() function.
/\* 
 \* write system call  
 \*/ 
write() 
{  
  rdwr(FWRITE); 
}

You are looking very old K&R style C code here, which predates even ANSI-C and function prototypes, by the way.
So rdwr() a few lines down the function calls plock(), for as long as we are not dealing with a device special file (Here is where the Ingres “use raw devices” idea comes into play), then does the I/O and finally calls prele().
if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            plock(ip); 
        if(mode == FREAD) 
            readi(ip); 
        else 
            writei(ip); 
        if((ip->i_mode&(IFCHR&IFBLK)) == 0) 
            prele(ip);

plock() is what locks the actual inode and the origin of the observed behavior. It is is a misnomer, it’s not a pipe lock, it’s an inode lock.
/\* 
 \* Lock a pipe. 
 \* If its already locked, 
 \* set the WANT bit and sleep. 
 \*/ 
plock(ip) 
register struct inode \*ip; 
{ 


    while(ip->i_flag&ILOCK) { 
        ip->i_flag |= IWANT; 
        sleep((caddr_t)ip, PINOD); 
    } 
    ip->i_flag |= ILOCK; 
}

See the locking loop here: As as we do not have the lock, indicate desire to get the lock, then sleep on a lock release. When we exit the loop (because the inode is unlocked), lock the inode.
These are simple C Code lines, not special magic macros that translate into special magic TAS machine instructions. That is because the code here is so old that it comes from a world where we have single-die, single-core, single-thread CPUs. If your code is actually running (and this is kernel code!), then you are alone in the entire system. There is nobody else touching these variables as long as you have the CPU.
Under the lock, rdwr() above calls writei(). And writei() has a do loop which uses variables from the u-Area.
do { 
        bn = u.u_offset >> BSHIFT; 
        on = u.u_offset & BMASK; 
        n = min((unsigned)(BSIZE-on), u.u_count); 
        if (type!=IFBLK && type!=IFMPB) { 
            bn = bmap(ip, bn, B_WRITE); 
            if((long)bn<0) 
                return; 
            dev = ip->i_dev; 
        } 
        if(n == BSIZE)  
            bp = getblk(dev, bn); 
        else 
            bp = bread(dev, bn); 
        iomove(bp->b_un.b_addr+on, n, B_WRITE); 
        if(u.u_error != 0) 
            brelse(bp); 
        else 
            bdwrite(bp); 
        if(u.u_offset > ip->i_size && 
           (type==IFDIR || type==IFREG)) 
            ip->i_size = u.u_offset; 
        ip->i_flag |= IUPD|ICHG; 
    } while(u.u_error==0 && u.u_count!=0); 
}

The u-Area of a process at that time was a common data structure that the userland and the kernel used to communicate. Here it is being used to shift syscall parameters into the kernel. The write writes the data at u.u_base in userland into the current inode, at u.u_offset bytes in the file. There are u.u_count many bytes to write.
We convert the u.u_offset into a logical block number (the n-th block of a file), and an offset “on” within the block. We need to call bmap(). This function turns an inode number and block number within the file into a physical block number on a device.
We can then bring the relevant physical block into the buffer cache, using bread(), and then use iomove() to modify and dirty the block. As we brelse() it, it will eventually be written back to disk later.

There is an optimization here:
When the write is a full block, we do not read the block from disk. We just allocate a buffer using getblk(), and fill it. It will overwrite the data on disk completely, there is no old and new data to merge. Disk accesses are slow, in the 1970ies even more so than today, so not reading data that you are going to obliterate completely pays of substantially.
The loop continues as long as there are no errors and still blocks to write.
As we return from writei(), rdrw() resumes and will eventually prele() the inode lock.

How old is this stuff?


This is of course extremely old code, original V7 unix, almost as old as me: git blames places it’s age at 41 years, I was in the third class of a German basic school when this was written.
I chose this implementation, because it is very simple and because it is also what became immortalised in the performance destroying standard which we got to know as Posix File System Semantics.

Homework


You can have fun to find the matching functionality in a modern Linux kernel, threads, multicore, capabilities, namespaces, cgroups and dynamic data structures and all.
Compare code readability and complexity. Discuss. Is this progress? Why do you think so?
You can try to get a copy of “The Design of the Unix Operating System” by Maurice J. Bach. It will take you on a guided tour through the origins of our craft and the legacy we build on. The topics discussed in this note can be found on the pages 101ff, “WRITE” and “FILE AND RECORD LOCKING”.
If you are into operating systems, continue reading after Bach: “The Design and Implementation of the 4.3 BSD Operating System” builds on Bach’s work and showcases the progress and inventions that Kirk McKusick, Sam Leffler et al made after that.
If you are into comparative operating system design, read “Inside Windows NT” by Helen Custer after Bach and Leffler/McKusick, and try to understand the different ideas and world view behind that.

But we don’t use write(2) for atomic file updates!


Well, some of us do, but I agree that it is hard to get right: write(2) and writev(2) are very hard to handle properly in applications, as you need to write everything in a single call.
Most programs use another atomic operation in Unix, the rename(2) system call. You write file.new at your leisure, printf(), chunked writes() and all. When completed, rename file.new to file. This automatically unlinks the old version of file as well.
This is also the recommended approach to atomicity, because unlike write(2) it is stable in the face of the dreaded nightmare file system.
rename(2) was introduced really early in BSD Unix because of specific race problems in the V7 Unix early BSD patched and improved.
Before BSD, we only had link(2) and unlink(2). You can use a combination of these syscalls to implement a rename-like operation, but you need more than one syscall to do that.
In Unix, at the end of a syscall, before return to userland, the scheduler runs (Bach, Chapter 8). That is, at the end of each syscall, a process can be forced to yield the CPU. This is the cause for potential race conditions when not having a rename(2) as a single syscall, and that is why BSD came up with a single syscall for renaming files in the first place.
Renaming files for atomic updates can be taken to an art form: trylooking into the Maildir/ implementations as invented by qmail, and implemented in Dovecot and CyrusAnd that concludes this issue of Our Systems Legacy.

#erklaerbaer #wasmitcomputern #unix #linux #kernel

 
Block/XFS: Support alternative mirror device retry "[…]If #XFS successfully reads a metadata buffer off a raid1 but decides that [it]is garbage, today it will shut down the entire fs without trying any of the other mirrors. This patch[…]" #linux #kernel
lore.kernel.org/linux-block/15…


 
"[…] The error message persisted for a long time, […]unless a system administrator […] by running ls in the directory itself. […] This post describes the journey of investigating the issue […] by which we tracked down the bug […]" #nfs #linux #kernel
about.gitlab.com/2018/11/14/how…


 

 

Tracing a packet journey using Linux tracepoints, perf and eBPF


This is another good eBPF example
What I’d need is a tool which could tell me “Hey, I’ve seen your packet: It’s gone this way, on this interface, in this network namespace”.

Basically, what I’d need is a mtr for L2.

Does not exist? Let’s build one!
https://blog.yadutaf.fr/2017/07/28/tracing-a-packet-journey-using-linux-tracepoints-perf-ebpf/

Image/Photo

#linux #kernel #tracing

 
the patchset to "Remedy the STIBP/IBPB overhead" was updated: lore.kernel.org/lkml/201811251…
"[…] Thats hopefully the final version of this. […] removes the ptrace check for the always IBPB case. That's substantial overhead for dubious value […]" #linux #kernel


 
the patchset to "Remedy the STIBP/IBPB overhead" was updated: lore.kernel.org/lkml/201811251…
"[…] Thats hopefully the final version of this. […] removes the ptrace check for the always IBPB case. That's substantial overhead for dubious value […]" #linux #kernel


 
#Linux #kernel 4.20-rc4 is out "[…] the STIBP fixes are still being discussed and fine-tuned and haven't been merged yet. And there's a few mm fixes […] Nothing that should keep people from testing the 4.20 rc's, though, so go out and test. […]" lore.kernel.org/lkml/CAHk-=wgd…


 
#Linux #kernel 4.19.4 is now available in my "kernel vanilla repositories" for #Fedora 27, 28 and 29.
fedoraproject.org/wiki/Kernel_Va…

 

Gaining eBPF vision: A new way to trace Linux filesystem disk requests


Nice practical eBPF example pointed out by @Thorsten 'the Linux kernel logger' Leemhuis(6/6)

https://www.collabora.com/news-and-blog/blog/2018/11/21/gaining-ebpf-vision-tracing-linux-filesystem-disk-requests/

#linux #kernel #tracing

 
A bunch of new Stable and Longterm #Linux #Kernel got released today. With those 4.18 is EOL now: "Note, this is the LAST 4.18.y kernel release. 4.18 is now end-of-life, please move to 4.19.y at this point in time." lore.kernel.org/lkml/201811211…


 
A bunch of new Stable and Longterm #Linux #Kernel got released today. With those 4.18 is EOL now: "Note, this is the LAST 4.18.y kernel release. 4.18 is now end-of-life, please move to 4.19.y at this point in time." lore.kernel.org/lkml/201811211…


 
#DTrace based on #eBPF and other #Linux #kernel tracing facilities
lore.kernel.org/lkml/201811160…
"[…]We are envisioning a future for DTrace where we can leverage its strengths in the areas where it matters most […]while building on the existing tracing infrastructure in Linux. […]"


 
#DTrace based on #eBPF and other #Linux #kernel tracing facilities
lore.kernel.org/lkml/201811160…
"[…]We are envisioning a future for DTrace where we can leverage its strengths in the areas where it matters most […]while building on the existing tracing infrastructure in Linux. […]"


 
#Linux #kernel 4.20-rc2 is out lore.kernel.org/lkml/CAHk-=wiW… "[…] Fairly normal week […] about half drivers, wih the rest being the usual architecture updates, ooling, networking, and some filesystem updates. […] Go forth and test, […]"


 
Pull Request tracker bot korg.wiki.kernel.org/userdoc/prtrac… "[…] sends automatic notifications when a pull request submitted to a mailing list has been merged to one of the #Linux #kernel repositories. […]at this time, it is only used to track mainline […] #LKML […]" Thx @monsieuricon!)