How to restore all unstaged files in with git
February 8, 2024
0 comments GitHub, MacOSX, Linux
tl;dr git restore -- .
I can't believe I didn't know this! Maybe, at one point, I did, but, since forgotten.
You're in a Git repo and you have edited 4 files and run git status
and see this:
❯ git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: four.txt
modified: one.txt
modified: three.txt
modified: two.txt
no changes added to commit (use "git add" and/or "git commit -a")
Suppose you realize; "Oh no! I didn't mean to make those changes in three.txt" You can restore that file by mentioning it by name:
❯ git restore three.txt
❯ git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: four.txt
modified: one.txt
modified: two.txt
no changes added to commit (use "git add" and/or "git commit -a")
Now, suppose you realize you want to all of those modified files. How do you restore them all without mentioning each and every one by name. Simple:
❯ git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: four.txt
modified: one.txt
modified: two.txt
no changes added to commit (use "git add" and/or "git commit -a")
❯ git restore -- .
❯ git status
On branch main
nothing to commit, working tree clean
The "trick" is: git restore -- .
As far as I understand restore
is the new word for checkout
. You can equally run git checkout -- .
too.
How slow is Node to Brotli decompress a file compared to not having to decompress?
January 19, 2024
3 comments Node, MacOSX, Linux
tl;dr; Not very slow.
At work, we have some very large .json
that get included in a Docker image. The Node server then opens these files at runtime and displays certain data from that. To make the Docker image not too large, we compress these .json
files at build-time. We compress the .json
files with Brotli to make a .json.br
file. Then, in the Node server code, we read them in and decompress them at runtime. It looks something like this:
export function readCompressedJsonFile(xpath) {
return JSON.parse(brotliDecompressSync(fs.readFileSync(xpath)))
}
The advantage of compressing them first, at build time, which is GitHub Actions, is that the Docker image becomes smaller which is advantageous when shipping that image to a registry and asking Azure App Service to deploy it. But I was wondering, is this a smart trade-off? In a sense, why compromise on runtime (which faces users) to save time and resources at build-time, which is mostly done away from the eyes of users? The question was; how much overhead is it to have to decompress the files after its data has been read from disk to memory?
The benchmark
The files I test with are as follows:
❯ ls -lh pageinfo*
-rw-r--r-- 1 peterbe staff 2.5M Jan 19 08:48 pageinfo-en-ja-es.json
-rw-r--r-- 1 peterbe staff 293K Jan 19 08:48 pageinfo-en-ja-es.json.br
-rw-r--r-- 1 peterbe staff 805K Jan 19 08:48 pageinfo-en.json
-rw-r--r-- 1 peterbe staff 100K Jan 19 08:48 pageinfo-en.json.br
There are 2 groups:
- Only English (
en
) - 3 times larger because it has English, Japanese, and Spanish
And for each file, you can see the effect of having compressed them with Brotli.
- The smaller JSON file compresses 8x
- The larger JSON file compresses 9x
Here's the benchmark code:
import fs from "fs";
import { brotliDecompressSync } from "zlib";
import { Bench } from "tinybench";
const JSON_FILE = "pageinfo-en.json";
const BROTLI_JSON_FILE = "pageinfo-en.json.br";
const LARGE_JSON_FILE = "pageinfo-en-ja-es.json";
const BROTLI_LARGE_JSON_FILE = "pageinfo-en-ja-es.json.br";
function f1() {
const data = fs.readFileSync(JSON_FILE, "utf8");
return Object.keys(JSON.parse(data)).length;
}
function f2() {
const data = brotliDecompressSync(fs.readFileSync(BROTLI_JSON_FILE));
return Object.keys(JSON.parse(data)).length;
}
function f3() {
const data = fs.readFileSync(LARGE_JSON_FILE, "utf8");
return Object.keys(JSON.parse(data)).length;
}
function f4() {
const data = brotliDecompressSync(fs.readFileSync(BROTLI_LARGE_JSON_FILE));
return Object.keys(JSON.parse(data)).length;
}
console.assert(f1() === 2633);
console.assert(f2() === 2633);
console.assert(f3() === 7767);
console.assert(f4() === 7767);
const bench = new Bench({ time: 100 });
bench.add("f1", f1).add("f2", f2).add("f3", f3).add("f4", f4);
await bench.warmup(); // make results more reliable, ref: https://github.com/tinylibs/tinybench/pull/50
await bench.run();
console.table(bench.table());
Here's the output from tinybench
:
┌─────────┬───────────┬─────────┬────────────────────┬──────────┬─────────┐ │ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │ ├─────────┼───────────┼─────────┼────────────────────┼──────────┼─────────┤ │ 0 │ 'f1' │ '179' │ 5563384.55941942 │ '±6.23%' │ 18 │ │ 1 │ 'f2' │ '150' │ 6627033.621072769 │ '±7.56%' │ 16 │ │ 2 │ 'f3' │ '50' │ 19906517.219543457 │ '±3.61%' │ 10 │ │ 3 │ 'f4' │ '44' │ 22339166.87965393 │ '±3.43%' │ 10 │ └─────────┴───────────┴─────────┴────────────────────┴──────────┴─────────┘
Note, this benchmark is done on my 2019 Intel MacBook Pro. This disk is not what we get from the Apline Docker image (running inside Azure App Service). To test that would be a different story. But, at least we can test it in Docker locally.
I created a Dockerfile that contains...
ARG NODE_VERSION=20.10.0 FROM node:${NODE_VERSION}-alpine
and run the same benchmark in there by running docker composite up --build
. The results are:
┌─────────┬───────────┬─────────┬────────────────────┬──────────┬─────────┐ │ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │ ├─────────┼───────────┼─────────┼────────────────────┼──────────┼─────────┤ │ 0 │ 'f1' │ '151' │ 6602581.124978315 │ '±1.98%' │ 16 │ │ 1 │ 'f2' │ '112' │ 8890548.4166656 │ '±7.42%' │ 12 │ │ 2 │ 'f3' │ '44' │ 22561206.40002191 │ '±1.95%' │ 10 │ │ 3 │ 'f4' │ '37' │ 26979896.599974018 │ '±1.07%' │ 10 │ └─────────┴───────────┴─────────┴────────────────────┴──────────┴─────────┘
Analysis/Conclusion
First, focussing on the smaller file: Processing the .json is 25% faster than the .json.br file
Then, the larger file: Processing the .json is 16% faster than the .json.br file
So that's what we're paying for a smaller Docker image. Depending on the size of the .json
file, your app runs ~20% slower at this operation. But remember, as a file on disk (in the Docker image), it's ~8x smaller.
I think, in conclusion: It's a small price to pay. It's worth doing. Your context depends.
Keep in mind the numbers there to process that 300KB pageinfo-en-ja-es.json.br
file, it was able to do that 37 times in one second. That means it took 27 milliseconds to process that file!
The caveats
To repeat, what was mentioned above: This was run in my Intel MacBook Pro. It's likely to behave differently in a real Docker image running inside Azure.
The thing that I wonder the most about is arguably something that actually doesn't matter. 🙃
When you ask it to read in a .json.br
file, there's less data to ask from the disk into memory. That's a win. You lose on CPU work but gain on disk I/O. But only the end net result matters so in a sense that's just an "implementation detail".
Admittedly, I don't know if the macOS or the Linux kernel does things with caching the layer between the physical disk and RAM for these files. The benchmark effectively asks "Hey, hard disk, please send me a file called ..." and this could be cached in some layer beyond my knowledge/comprehension. In a real production server, this only happens once because once the whole file is read, decompressed, and parsed, it won't be asked for again. Like, ever. But in a benchmark, perhaps the very first ask of the file is slower and all the other runs are unrealistically faster.
Feel free to clone https://github.com/peterbe/reading-json-files and mess around to run your own tests. Perhaps see what effect async
can have. Or perhaps try it with Bun and it's file system API.
Search hidden directories with ripgrep, by default
December 30, 2023
0 comments MacOSX, Linux
Do you use rg
(ripgrep) all the time on the command line? Yes, so do I. An annoying problem with it is that, by default, it does not search hidden directories.
"A file or directory is considered hidden if its base name starts with a dot character (.)."
One such directory, that is very important in my git/GitHub-based projects (which is all of mine by the way) is the .github
directory. So I cd into a directory and it finds nothing:
cd ~/dev/remix-peterbecom
rg actions/setup-node
# Empty! I.e. no results
It doesn't find anything because the file .github/workflows/test.yml
is part of a hidden directory.
The quick solution to this is to use --hidden
:
❯ rg --hidden actions/setup-node
.github/workflows/test.yml
20: uses: actions/setup-node@v4
I find it very rare that I would not want to search hidden directories. So I added this to my ~/.zshrc
file:
alias rg='rg --hidden'
Now, this happens:
❯ rg actions/setup-node
.github/workflows/test.yml
20: uses: actions/setup-node@v4
With that being set, it's actually possible to "undo" the behavior. You can use --no-hidden
❯ rg --no-hidden actions/setup-node
And that can useful if there is a hidden directory that is not git ignored yet. For example .download-cache/
.
Zipping files is appending by default - Watch out!
October 4, 2023
0 comments Linux
This is not a bug in the age-old zip
Linux program. It's maybe a bug in its intuitiveness.
I have a piece of automation that downloads a zip file from a file storage cache (GitHub Actions actions/cache
in this case). Then, it unpacks it, and plucks some of the files from it into another fresh new directory. Lastly, it creates a new .zip
file with the same name. The same name because that way, when the process is done, it uploads the new .zip
file into the file storage cache. But be careful; does it really create a new .zip
file?
To demonstrate the surprise:
$ cd /tmp/
$ mkdir somefiles
$ touch somefiles/file1.txt
$ touch somefiles/file2.txt
$ zip -r somefiles.zip somefiles
adding: somefiles/ (stored 0%)
adding: somefiles/file1.txt (stored 0%)
adding: somefiles/file2.txt (stored 0%)
Now we have a somefiles.zip
to work with. It has 2 files in it.
Next session. Let's say it's another day and a fresh new /tmp
directory and the previous somefiles.txt
has been downloaded from the first session. This time we want to create a new somefile
directory but in it, only have file2.txt
from before and a new file file3.txt
.
$ rm -fr somefiles
$ unzip somefiles.zip
Archive: somefiles.zip
creating: somefiles/
extracting: somefiles/file1.txt
extracting: somefiles/file2.txt
$ rm somefiles/file1.txt
$ touch somefiles/file3.txt
$ zip -r somefiles.zip somefiles
updating: somefiles/ (stored 0%)
updating: somefiles/file2.txt (stored 0%)
adding: somefiles/file3.txt (stored 0%)
And here comes the surprise, let's peek into the newly zipped up somefiles.txt
(which was made from the somefiles/
directory which only contained file2.txt
and file3.txt
):
$ rm -fr somefiles
$ unzip -l somefiles.zip
Archive: somefiles.zip
Length Date Time Name
--------- ---------- ----- ----
0 2023-10-04 16:06 somefiles/
0 2023-10-04 16:05 somefiles/file1.txt
0 2023-10-04 16:06 somefiles/file2.txt
0 2023-10-04 16:06 somefiles/file3.txt
--------- -------
0 4 files
I did not see that coming! The command zip -r somefiles.zip somefiles/
doesn't create a fresh new .zip
file based on recursively walking the somefiles
directory. It does an append by default!
The solution is easy. Right before the zip -r somefiles.zip somefiles
command, do a rm somefiles.zip
.
How to count the most common lines in a file
October 7, 2022
0 comments Bash, MacOSX, Linux
tl;dr sort myfile.log | uniq -c | sort -n -r
I wanted to count recurring lines in a log file and started writing a complicated Python script but then wondered if I can just do it with bash basics.
And after some poking and experimenting I found a really simple one-liner that I'm going to try to remember for next time:
You can't argue with the nice results :)
▶ cat myfile.log
one
two
three
one
two
one
once
one
▶ sort myfile.log | uniq -c | sort -n -r
4 one
2 two
1 three
1 once
Find the largest node_modules directories with bash
September 30, 2022
0 comments Bash, MacOSX, Linux
tl;dr; fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr
It's very possible that there's a tool that does this, but if so please enlighten me.
The objective is to find which of all your various projects' node_modules
directory is eating up the most disk space.
The challenge is that often you have nested node_modules
within and they shouldn't be included.
The command uses fd
which comes from brew install fd
and it's a fast alternative to the built-in find
. Definitely worth investing in if you like to live fast on the command line.
The other important command here is rg
which comes from brew install ripgrep
and is a fast alternative to built-in grep
. Sure, I think one can use find
and grep
but that can be left as an exercise to the reader.
▶ fd -I -t d node_modules | rg -v 'node_modules/(\w|@)' | xargs du -sh | sort -hr 1.1G ./GROCER/groce/node_modules/ 1.0G ./SHOULDWATCH/youshouldwatch/node_modules/ 826M ./PETERBECOM/django-peterbecom/adminui/node_modules/ 679M ./JAVASCRIPT/wmr/node_modules/ 546M ./WORKON/workon-fire/node_modules/ 539M ./PETERBECOM/chiveproxy/node_modules/ 506M ./JAVASCRIPT/minimalcss-website/node_modules/ 491M ./WORKON/workon/node_modules/ 457M ./JAVASCRIPT/battleshits/node_modules/ 445M ./GITHUB/DOCS/docs-internal/node_modules/ 431M ./GITHUB/DOCS/docs/node_modules/ 418M ./PETERBECOM/preact-cli-peterbecom/node_modules/ 418M ./PETERBECOM/django-peterbecom/adminui0/node_modules/ 399M ./GITHUB/THEHUB/thehub/node_modules/ ...
How it works:
fd -I -t d node_modules
: Find all directories callednode_modules
but ignore any.gitignore
directives in their parent directories.rg -v 'node_modules/(\w|@)'
: Exclude all finds where the wordnode_modules/
is followed by a@
or a[a-z0-9]
character.xargs du -sh
: For each line, rundu -sh
on it. That's like doingcd some/directory && du -sh
, wheredu
means "disk usage" and-s
means total and-h
means human-readable.sort -hr
: Sort by the first column as a "human numeric sort" meaning it understands that "1M" is more than "20K"
Now, if I want to free up some disk space, I can look through the list and if I recognize a project I almost never work on any more, I just send it to rm -fr
.