Notes on the new Claude analysis JavaScript code execution tool

179 points by bstsb 8 months ago

advaith08 8 months ago

The custom instructions to the model say:

"Please note that this is similar but not identical to the antArtifact syntax which is used for Artifacts; sorry for the ambiguity."

They seem to be apologizing to the model in the system prompt?? This is so intriguing

therein 8 months ago

I wonder if they tried the following:
> Please note that this is similar but not identical to the antArtifact syntax which is used for Artifacts; sorry for the ambiguity, antArtifact syntax was developed by the late grandmother of one our engineers and holds sentimental value.
lelandfe 8 months ago

Unfortunately, their prompt engineer learned of Roko's basilisk
andai 8 months ago

Has anyone looked into the effect of politeness on performance?
- pawelduda 8 months ago
  
  If you assume asking someone nicely is more likely for them to try help you, and this tendency shows in the training set, wouldn't you be more likely to "retrieve" a better answer from the model trained on it? Take this with a grain of salt, it's just my guess not backed by anything
  - andai 8 months ago
    
    That makes intuitive sense, at least for raw GPT-3. The interesting question is whether the slave programming — er, instruction finetuning — makes it unnecessary.
    
    pawelduda 8 months ago
    
    Over time, most likely yes
- dakotasmith 8 months ago
  
  Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
  https://arxiv.org/abs/2307.11760
- tkgally 8 months ago
  
  I've wondered the same thing. I tend to sprinkle my LLM prompts with "please"s, especially with longer prompts, as I feel that "please" might make clearer where the main request to the LLM is. I have no evidence that they actually yield better results, though, and people I share my prompts with might think I'm anthropomorphizing the models.
l1n 8 months ago

Multiple system prompt segments can be composed depending on needs, so it's useful for this sort of thing to be there to resolve inconsistencies.

animal_spirits 8 months ago

That's an interesting idea to generate javascript and execute it client side rather than server side. I'm sure that saves a ton of money for Anthropic not by not having to spin up a server for each execution.

qeternity 8 months ago

The cost savings for this are going to be a rounding error. I imagine this is a broader push to be able to have Claude pilot your browser (and other applications) in the future. This is the right way to go about it versus having a headless agent: users can be in the loop and you can bootstrap and existing environment.
Otoh it’s going to be a security nightmare.
- rajnathani 8 months ago
  
  The cost-savings would actually be significant. Spinning up a sandboxed container/VM or chroot jail a thousand times a month for a user paying a $20/month, when you already as a company have huge GPU bills on the training and inference side and NRE costs, would be gaping.
  - qeternity 8 months ago
    
    I really really don't think you understand how cheap it would be to spin up a node.js env a thousand times a month in a container. Let's be really really conservative and say that each invocation takes 30s of CPU time, resulting in 30,000 CPU seconds per month. Let's say that CPU cores can be had for $10/mo. We are talking about 10 cents per user per month. And in reality, this is still probably over an order of magnitude too high. You are literally talking fractions of a cent in reality.
    
    rajnathani 8 months ago
    
    Yes actually it would be cheaper if one pre-provisions VMs. Albeit, to ensure sufficient provisioned capacity that one would have to slightly over-provision here, and for a coding power-user that it would be almost $2-3 cloud costs per month excluding the software engineering costs of maintaining this fleet of servers and scheduling jobs on it.
    
    qeternity 8 months ago
    
    I laid out all my arithmetic. I don't know what you disagree with. How do you get to $2-3 per month? I suspect you don't actually understand how this would be run at scale, because I can't see any universe in which a single power user is generating $2-3 of compute cost _for a javascript container_.
    
    rajnathani 8 months ago
    
    My definition of a coding power-user would be hooking up Cursor/Copilot to the JS runtime to run JS 10-20 times a minute.
bhl 8 months ago

Makes a lot of sense given they released Artifacts previously, which let you build simple web apps.
The browser nowadays can be a web dev environment with nodebox and webcontainers; and JavaScript is the default language there.
Allows you to build experiences like interactive charts easier.
stanleydrew 8 months ago

Also means you're not having to do a bunch of isolation work to make the server-side execution environment safe.
- Me1000 8 months ago
  
  This is the real value here. Keeping a secure environment to run untrusted code along side user data is a real liability for them. It's not their core competency either, so they can just lean on browser sandboxing and not worry about it.
  - cruffle_duffle 8 months ago
    
    How is doing it server side a different challenge than something like google collab or any of those Jupyter notebook type services?
    
    donavanm 8 months ago
    
    Shared resources and multitenancy are how you get efficiency and density. Those are at direct odds with strict security boundaries. IME you need hardware supported virtualization for consistent security boundary of arbitrary compute. Linux namespaces (“containers”) and language runtime isolation are not it for critical workloads, see some of the early aws nitro/firecracker works for more details. I _assume_ the cases you mentioned may be more constrained, or actually backed by VM partitions per customer.
    
    trillic 8 months ago
    
    Google Collab are all individual VMs. It seems Anthropic doesn’t want to be in the “host a VM for every single user” business.

simonw 8 months ago

I've been trying to figure out the right pattern for running untrusted JavaScript code in a browser sandbox that's controlled by a page for a while now, looks like Anthropic have figured that out. Hoping someone can reverse engineer exactly how they are doing this - their JavaScript code is too obfuscated for me to dig out the tricks, sadly.

spankalee 8 months ago

The key is running the untrusted code in a cross-origin iframe so you can rely on the same-origin policies and `sandbox`[1].
You can control the code in a number of ways - loading a trusted shim that sets up a postMessage handler is pretty common. You can be careful and do that in a way that untructed code can't forge messages to look like their from the trusted code.
Another way is to use two iframes to the untrusted origin. One only loads untrusted code, the other loads a control API that talks to the trusted code. You can then to the loading into the iframe with a service worker. This is how the Playground Elements work (they're a set of web components that let you safely embed a mini IDE for code samples) https://github.com/google/playground-elements
[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if...
- purple-leafy 8 months ago
  
  The cross origin iframe method is the same I’ve employed in A few browser extensions I’ve built
TimTheTinker 8 months ago

You should check out how Figma plugins work. They have blog posts on all the tradeoffs they considered.
What I believe they settled on was a JS interpreter compiled to WASM -- it can run arbitrary JS but with very well-defined and restricted interfaces to the outside world (the browser's JS runtime environment).
- bhl 8 months ago
  
  > We now use QuickJS, a JavaScript VM written in C and cross-compiled to WebAssembly.
  https://www.figma.com/blog/an-update-on-plugin-security/
  - rekttrader 8 months ago
    
    Yo dog, we put a JavaScript VM inside your JavaScript VM
h1fra 8 months ago

Much easier in the browser that has V8 isolate, however even with webworkers you still want to control CPU/network hijacking which is not ideal.
If it's only the user's own code it's fine but if they can run code from others it's a massive pain indeed.
On the server it's still not easy in 2024, even with Firecracker (doesn't work on mac), Workerd (is a subset of NodeJS), isolated-vm (only pre-compiled code, no modules).
dartos 8 months ago

Isn’t that how all JavaScript code runs in a browser?
- TheRealPomax 8 months ago
  
  Isn't what how all JS runs in the browser? There are different restrictions based on where JS comes from, and what context it gets loaded into.
  - dartos 8 months ago
    
    All browser js runs in a browser sandbox and, by default, none of it needs to be explicitly trusted in most browsers.
    I don’t think there are very many restrictions on what js can do on a given page. At least none come to mind.
    Not really sure you mean by “context” either. Maybe service workers? Unless you’re talking about loading js within iframes… but that’s a different can of worms.
    
    mattmanser 8 months ago
    
    You've misunderstood the GP's question. If you read the other answers you might understand what he's asking. Hence exactly why they're all talking about iframes.
    You used to be able to do it quite easily, but it meant people could essentially impersonate the user if you got them to execute some javascript. So having a code editor would be a recipe for account hijacking.
    So gradually browsers locked it all down. Long gone are the days of just doing 'eval()'. In the 2000s I worked on code where we actually did that!
    Ah, the days of getting away with massive security holes that no-one even knew how to exploit.
    
    dartos 8 months ago
    
    > If you read the other answers you might understand what he's asking
    Dude, relax. There were no other comments when I asked…
aabhay 8 months ago

What are the attack vectors for a web browser js environment to do malicious things? All browser code is sandboxed via origin controls, and process isolation. It can’t even open an iframe and read the contents of that iframe.
- TimTheTinker 8 months ago
  
  It's a fine place to run code trusted by the server (or code trusted by the client within the scope of the app).
  But for code not trusted by either, it's bad -- user data in the app can be compromised/exfiltrated.
  Hence for third-party plugins for a web app, the built-in JS runtime doesn't have sufficient trust management capability.
- njtransit 8 months ago
  
  The attack vectors are either some type of credential or account compromise. Generally, these attacks fall under the cross-site scripting (XSS) umbrella. The browser exposes certain things to the JS context based on the origin. E.g. if you log in to facebook.com, facebook.com might set an authentication cookie that can be accessed in the JS context. Additionally, all outbound requests to facebook.com will include this authentication cookie. So, if you can execute JS in the context of facebook.com, you could steal this cookie or have the browser perform malicious actions that get implicitly authenticated.
mannanj 8 months ago

commenting to save this for later
- singularity2001 8 months ago
  
  I used this technique until someone told me that you can use the upvote arrow and find these in your profile
  - sunaookami 8 months ago
    
    You can also click on the post and then favorite them, but favorites are public.

thenaturalist 8 months ago

Funnily enough, I test code generation both on unpaid Claude and ChatGPT.

When working with Python, I've found Sonnet (pre 3.5) to be quite superior to ChatGPT (mostly 4, sometimes 3.5) with regards to verbosity, structure and prompt / instruct comprehension.

I've switched to a JavaScript project two weeks ago and the tables have turned.

Sonnet 3.5 is much more verbose and I need to make corrections a few times, whereas ChatGPTs output is shorter and on point.

I'll closely follow if this improves if Claude are focussing on JS themselves.

bravura 8 months ago

Don't call me crazy (I am actually), but sometimes I will keep both ChatGPT and Claude open side-by-side and use them to audit each other.
I'll give them the same prompt.
When they respond, re-prompt with: "What are your thoughts on this approach? Pros and cons. Integrate the best ideas from both: [answer from the other model]"
Repeat until total satisfaction or frustration is achieved.
- emmanueloga_ 8 months ago
  
  This is similar to what Aider does in "architect" mode [1].
  --
  1: https://aider.chat/docs/usage/modes.html#architect-mode-and-...

mritchie712 8 months ago

duckdb-wasm[0] would be a good addition here. We use it in Definite[1] and I can't say enough good things about duckdb in general.

0 - https://github.com/duckdb/duckdb-wasm

1 - https://www.definite.app/

refulgentis 8 months ago

Interesting: I'm curious, what about it helps here specifically.
Approaching it naively and undercaffeinated, it sounds abstract, as in it would benefit the way any code could benefit from a persistence layer / DB
Also I'm curious if it would require a special one-off integration to make it work, or could it write JS that just imported the library?

koolala 8 months ago

JavaScript is the perfect language for this. I can't wait for a sandboxed coding environment to totally set AI loose.

mlejva 8 months ago

Shameless plug here. We're building exactly this at E2B [0] (I'm the CEO). Sandboxed cloud environments for running AI-generated code. We're fully open-source [1] as well.
[0] https://e2b.dev
[1] https://github.com/e2b-dev
- bhl 8 months ago
  
  Is sandboxed browser environments on your roadmap? Would much prefer to use the client's runtime for non-computational expensive things like web dev.
croes 8 months ago

They could run a little crypto miner to get more profit

nprateem 8 months ago

NGL I was impressed when I asked Claude how to do some fancy UI stuff and it just spat out some working react. A few hours later and I'd saved £500 I was going to spend on a designer.

willsmith72 8 months ago

This is a great step, but to me not very useful until the move out of context. Still I'm high on anthropic and happy gen ai didn't turn into a winner-take-all market like everyone predicted in 2021.

freediver 8 months ago

It will work for any generic data, like a blog post. You can ask it to visualize the 'key concepts'.