> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
is the most interesting one.
A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.
Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
x' = (L@x + B@x + x)
The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)
There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!
> doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I think this touches on something fundamental. As a stand-alone operation matmul is ugly because it's arbitrary. In other words.. if the goal was just to entangle values, there's a bunch of ways to do it, so why this particular way landing on ae+bg etc? You kind of need algebra/geometry to justify matmul this way, which makes it obviously useful, but now it's still ugly, exactly because you had to invoke this other stuff.
Compare that situation to algebra and geometry themselves, which in a real sense don't need each other. Or to things like logic, sets, categories, processes, numbers, knots, games, etc where you can build up piles of stuff based on it in a whole rich universe before you need to appeal to much that is "outside". And in those universes operations would be defined mostly in ways that were more like "natural" or "necessary" without anything feeling arbitrary.
Traditional matmul is beautiful in the sense of "connections across and between", where all the particulars do become necessary. For those that prefer a certain amount of abstract perfection / platonism / etc or those with a taste for foundations though, it's understandable if it's not that appealing. This is related to, but not the same as the pure vs applied split.
Do low rank/block diagonal matrices come up in LLMs often? What about banded or block tridiagonal? Intuitively banded matrices seem like they ought to be good at encoding things about the world… everything is connected but not randomly so.
The computations in transformers are actually generalized tensor tensor contractions implemented as matrix multiplications. Their efficient implementation in gpu hardware involves many algebraic gems and is a work of art. You can have a taste of the complexity involved in their design in this Youtube video: https://www.youtube.com/live/ufa4pmBOBT8
> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
Elegance is a silly critique. Imagine instead we were spending trillions on floral bouquets, calligraphy, and porcelain tea sets. I would argue that would be a bad allocation of resources.
What matters to me is whether it solves the problems we have. Not how elegant we are in doing so. And to the extent AI fails to do that, I think those are valid critiques. Not how elegant it is.
the aesthetics of math and physics is by far the most boring discussion that can be had. i used to be utterly repulsed by such talk in undergrad - beauty this and that. it absolutely always felt affected and put on - as if you talk about it enough, you'll actually convince people outside of the major to give you the same plaudits as real artists.... yea right lol.
I will emphasize your point more forcefully. All mathematicians I know work on what they work because of the beauty and aesthetics in their field.
Much like sex. Sex has reproductive utility but that's not why most people engage in it. Those who do are are missing much.
Notion of beauty for a mathematician is quite specialized. It's the difference between spaghetti code that works and an elegant and efficient code that is correct. They are easy to build upon efficiently.
My guy you know lots of people in here have read Feynman right? You should cite him instead of pretending you were clever enough to come up with the analogy yourself.
Quite the contrary. I expect majority of HN readers to know the quote, base 0 if you will, and not harbor thoughts that by having read it they are a part of an exclusive club.
Channeling Good Will Hunting much huh? Most HN'ers would have watched that too.
I have no idea what you're trying to say - it is generally understood everywhere in the world (ie all forms of human culture) that it's pathetic to pass off someone else's insights as your own.
> You can still find citations of those papers to this day.
That's not what I contested. What fraction of people who use differentials in their published work still cite Newton or Leibnitz was the point. You can count number of such citations in last 10 years of say neural nets literature, or applied maths literature and report. Thats plenty of use.
Citations to their differential calculus that are still made are mostly in the context of history of math.
Seems numeracy or comprehension is not your strong point. LOL.
> What fraction of people who use differentials in their published work still cite Newton or Leibnitz was the point
Those papers were written in the 1600s. "The character of physical law", the essay you're ripping off, was written in 1964. 100% papers from the 1960s are cited every single time the techniques are used.
You are as tedious as the original refrain I was complaining about (which is not at all ironic). What's most tedious is you're not actually a mathematician but presume to speak for them.
You are changing goal posts now. Your absolute claim was
> it's pathetic to pass off someone else's insights as your own.
To which my point was citations are made when there is an expectation of originality. By now Feynman's anecdotes are folklore and folks wisdom.
OK let's go by your standards. Cooley Tukey's FFT algorithm was "discovered" by them in around 1965. How often do they get a citation when FFT is used, especially in comments on a social site, such as HN is.
LOL even 10 years old results do not get cited because they are considered common knowledge.
That said, Witt's notion of beauty that Propp is critiquing in the posted article is just plane idiotic. Lack of commutativity is not lack of beauty. What a stupid idea.
Mathematical beauty and imagination is different. One of Hilbert's grad students dropped out to become a poet.Hilbert is reported to have said: 'I never thought he had enough imagination to be a mathematician.'
A little unsolicited advice: if you are an aspiring mathematician(I am very happy for you if you are), but if you do not have a sense of a good taste or mathematical beauty, you probably will probably not have a good time.
> if you are an aspiring mathematician, and you do not have a sense of a good taste or mathematical beauty, you probably will probably not have a good time
Lol I have a PhD from a T10 and 15 published papers. I'm pretty sure I don't need your advice on "taste" or "beauty".
Lol did you think this was clever? You just literally reiterated exactly what I said. See, if you had said "there are many pianists that find beauty in math" - you know like how many mathematicians find beauty in piano concertos - then you'd have me.
Pianists don't find beauty in written maths, but mathematicians don't usually find beauty in sheet music either. It is the performance, accesible to our senses, that can convey beauty even to amateurs.
Accidentally - in the parts of maths where the concepts can be visualized, such as fractal theory, non-mathematicians seem to love what they see.
More like watching a weaving machine than watching a person hammer nails imo. Maybe like an old-time mill, with several machines if you think in terms of actual processing on an accelerator?
There's a wooden weaving machine at a heritage museum near me that gives me the same 'taste' in my brain as thinking about 'matrix' processing in a TPU or whatever.
The commutation problem has nothing to do with matrices. Rotations in space do not commute, and that will be the case whether you represent them as matrices or in some other way.
Maybe I'm just being ai-phobic or whatever but I strongly suspect the original article is written by grok based on how it goes off on bizarre tangents describing extremely complicated metaphors that are not only inaccurate but also wouldn't in any way be insightful even if they were accurate.
Well function composition f(g(x)) is not the same as g(f(x)) and when you represent f and g as matrices relative to some suitable set of basis functions then obviously AB and BA should be different. If the multiplication was defined any different, that wouldn’t work.
The way that I used to put this was, "If I put on my shoes before my socks, I'll get a different result than if it I put on my socks before my shoes. Order of operations matters."
Matrices represent linear transformations. Linear transformations are very natural and "beautiful" things. They are also very clearly not commutative: f(g(x)) is not the same as g(f(x)). The matrix algebra perfectly represents all of this, and as a result, FGx is not the same as GFx. It's only not "beautiful" if you believe that matrix multiplication is a random operation that exists for no reason.
So, Hardy focused on good explanations, and that was what he meant by beauty. Fair enough. The best objective definition of beauty I know of is "communication across a gap". This covers flowers, mathematics, and all kinds of art, including art I think is ugly such as Lucian Freud and Hans Giger I guess. So now I'm describing some things as beautiful and ugly at the same time, which betrays that there's a relative component to it (relative, objectively). That means I wish some things - including mathematics, which is usually tedious - communicated better, or explained things that seem to me to matter more: I feel in my gut that there's potential for this. So I don't rate mathematics as beautiful, any of it, personally.
But I'll admit its barely beautiful. Within which context, I guess the article's lawyering for the relative beauty of a matrix was a success, but I always liked them better than calculus or group theory anyway.
Beauty,symmetry,etc are largely irrelevant,
the key point it does not scale and burning
gigawatts to compute these matrices(even with all those tricks)
will not scale or compete with more efficient/direct methods
in the long term. Perhaps transformers are
very elaborate sunk-cost fallacy where pivoting to
scalable, simpler architecture is treated as "too risky"
even when cost of new GPU cluster dwarfs whatever it
takes to bring an architecture from 0 to chatGPT level.
The whole issue with this industry is that it moves so fast, there is no "long term." You're either in all the way in a likely futile attempt to capture the market or you're not in at all. So you also don't have time to really innovate on the hardware or software level and you need to put everything into training data and training hardware.
I am willing to admit that I find matrix multiplication ugly, as well as non-intuitive. But, I am also willing to admit that my finding it ugly is likely a result of my relative mathematical immaturity (despite my BS in math).
Maybe it helps to think of matrix multiplication as a special case of the composition of linear transformation. In the finite dimensional case they can be expressed as matrix multiplications.
I just finished reading lots of Stephen Witt quotes on goodreads. He comes across as a white Malcolm Gladwell, except that he actually does know what "Igon values" are so I don't know what his excuse is.
Matmuls (and GEMM) are a hardware-friendly way to stuff a lot of FLOPS into an operation. They also happen to be really useful as a constant-step discrete version of applying a mapping to a 1D scalar field.
I've mentioned it before, but I'd love for sparse operations to be more widespread in HPC hardware and software.
I think 4x4 matrices for 3D transforms (esp homogenous coordinates) are very elegant.
I think the intended critique is that the huge n*m matrices used in ML are not elegant - but the point is made poorly by pointing out properties of general matrices.
In ML matrices are just "data", or "weights". There are no interesting properties to these matrices. In a way a Neumann (https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant) Elephant.
Now, this might just be what it is needed for ML to work and deal with messy real world data! But mathematically it is not elegant.
I think you're right that the inelegant part is how AI seems to just consist of endless loops of multiplication. I say this as a graphics programmer who realized years ago that all those beautiful images were just lots of MxNs, and AI takes this to a whole new level. When I was in college they told us most of computing resources were used doing Linear Programming. I wonder when that crossed over to graphics or AI (or some networking operation like SSL)?
> When I was in college they told us most of computing resources were used doing Linear Programming.
I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
I have not only used linear programming in the industry, I have also had to write my own solver because the existing ones (even commercial) were to slow. (This was possible only because I only cares about a very approximate solution)
The triangulations you mention are important in the current group I'm working in.
And astronomy tends to throw up technology that becomes widely used (WiFi being the obvious example) or becomes of "interest" to governments. I expect that AMR code will be used/ported to nuclear simulations if it proves to be useful. Do I expect it to be used in a CRUD app? Obviously not, but use by most software shops isn't a measure of importance.
Part of the reason why linear programming does need t get used as often is that there are no industry standard software implementation that is not atrociously priced. Same deal with Mathematica.
Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
Elegant territory. Inelegant, brute-force, number crunching map.
The inelegance to me isn't in the definition of the operation, but that it's doing a huge amount of brute-force work to mix every part of the input with every other part, when the answer really only depends on a tiny fraction of the input. If we somehow "just knew" what parts to look at, we could get the answer much more efficiently.
Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
If the O(n^3) schoolbook multiplication were the best that could be done, then I'd totally agree that "it's simply the nature of matrices to have a bulky multiplication process". Yet there's a whole series of algorithms (from the Strassen algorithm onward) that use ever-more-clever ways to recursively batch things up and decrease the asymptotic complexity, most of which aren't remotely practical. And for all I know, it could go on forever down to O(n^(2+ε)). Overall, I hate not being able to get a straight answer for "how hard is it, really".
Maybe the problem is that matrices are too general.
You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
In US schools during K-12, we generally learn functions in two ways:
1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
The author has exclusive claim to their own aesthetic sensibilities, of course, but the language in the piece suggests some degree of universality. Whereas in fact, effectively no one who is knowledgeable about math would share the view that noncommutative operations are ugly by virtue of being noncommutative. It’s a completely foreign idea, like a poet saying that the only beautiful poems are the palindromic ones.
Honestly, in a purely technical sense, I do find it beautiful how you can take matrix multiplication and a shit-ton of data, and get a program that can talk to you, solve problems, and generate believable speech and imagery.
There are many complications arising from such a thing existing, and from what was needed to bring it into existence (and at the cost of whom), I'll never deny that. I just can't comprehend how someone can find the technical aspects repulsive in isolation.
It feels a lot like trying to convince someone that nuclear weapons are bad by defending that splitting an atom is akin to banging a rock against a coconut to split it in two.
maybe the issue boils down to overloading the term "multiplication". If mathematicians instead invented a new word here, people would get tripped up less (similarly for 'dot' and 'cross' "products")
i think a lot of issues arise from using analogies. Another one us complex numbers as 2D vectors. Its an ok analogy.. Except complex numbers can be multiplied where are 2D coordinates can not. Your weird new nonvectors are now spinning and people are left confused
>Matrix algebra is the language of symmetry and transformation, and the fact that a followed by b differs from b followed by a is no surprise; to expect the two transformations to coincide is to seek symmetry in the wrong place — like judging a dog’s beauty by whether its tail resembles its head.
The way I've always explained this to non-algebra people is to imagine driving in a city downtown. If you're at an intersection and you turn right, then left at the next intersection, you'll end up at a completely different spot than if you were to instead turn left and then right.
I doubt anyone of the past or present could fully describe what a matrix is, and what its multiplication is. There are many ways people looked at it so far - as a spatial transformation, dot products and so on. I don't think the description is complete in any significant way.
That's because we don't fully understand what a number is and what a multiplication is. We defined -x and 1/x as inverses (additive and multiplicative), but what is -1/x ? Let's consider them as operations. Apply any one of them on any other of them, you get the third one. Thus they occupy peer status. But we hardly ever talked about -1/x.
This is just low brow philosophical sounding rubbish of the same variety as "what is 'is'" nobody knows. Sounds profound though.
Matrix is just one way to organize data. When linear operators are organized this way composition of linear operators map to matrix multiplication.
But that is just one of the ways that multiplication may be defined on matrices, Hadamard products, Tensor product, Khatri-Rao product are some of the other examples. They all correspond to different mathematical structures one wants to explore or use. If linear algebraic structures is what ones to explore or use then one gets matrix multiplication.
As someone who never got deeply into math but deeply into programming they just seemed like an incompletely generalized data structure with an interesting "canonical" algorithm that can be used on it. In some cases, if you arrange your data into the structure correctly, you can use it to model interesting real world phenomenon.
It feels like Linear Algebra tries to get at the heart of this generality but the structure and operator is more constrained than it ultimately could be. It's a small oddball computational device that can be tersely written into papers and widely understood. I always find pseudocode easier to follow and reason about but that's my particular bias.
I get your point, but i think the real issue is -(1/(-1/x)). It is the one that is being overlooked the most in our society, as if it were something normal, but it contains some of the deepest truths imho.
Not sure what you are talking about. What you wrote reduces to just x. What I meant was, if you substitute say, -x for x in -1/x, you get 1/x, which is the third inverse. Same is true for the other two pairs. So, if we call them functions f, g and h, then, f=g(h)=h(g); g=f(h)=h(f); h=f(g)=g(f)
I think this sentence:
> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
is the most interesting one.
A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.
A somewhat more beautiful matmul for neural networks is given by the Monarch paper: https://arxiv.org/abs/2204.00595
Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
x' = (L@x + B@x + x)
The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)
https://hazyresearch.stanford.edu/blog/2023-12-11-truly-subq... is an approachable overview of the Monarch approach, for those interested!
There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!
> doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I think this touches on something fundamental. As a stand-alone operation matmul is ugly because it's arbitrary. In other words.. if the goal was just to entangle values, there's a bunch of ways to do it, so why this particular way landing on ae+bg etc? You kind of need algebra/geometry to justify matmul this way, which makes it obviously useful, but now it's still ugly, exactly because you had to invoke this other stuff.
Compare that situation to algebra and geometry themselves, which in a real sense don't need each other. Or to things like logic, sets, categories, processes, numbers, knots, games, etc where you can build up piles of stuff based on it in a whole rich universe before you need to appeal to much that is "outside". And in those universes operations would be defined mostly in ways that were more like "natural" or "necessary" without anything feeling arbitrary.
Traditional matmul is beautiful in the sense of "connections across and between", where all the particulars do become necessary. For those that prefer a certain amount of abstract perfection / platonism / etc or those with a taste for foundations though, it's understandable if it's not that appealing. This is related to, but not the same as the pure vs applied split.
Do low rank/block diagonal matrices come up in LLMs often? What about banded or block tridiagonal? Intuitively banded matrices seem like they ought to be good at encoding things about the world… everything is connected but not randomly so.
https://www.youtube.com/watch?v=Ruf-cLr2PZ8 I always think of this when thinking about the gracefulness of a hammer.
Upholstery tacks are sold sterilized for people holding them in their mouths. Now I wonder the same about drywall nails!
Thanks for the link; that is absolutely masterful work.
Turns out it’s the skill of the person handling the hammer that matters most. Enlightening! Appreciate the link!
Wow, thank you for this gem!!!
Yes!
>The fast Fourier transform viewed as a sparse matrix factorization of the DFT
Riffing further on the Fourier connection: are you planning to explore the link between matmul and differentiation?
Using the "Pauli-Z" matrix that you introduced without a straightforward motivation, eg.
(I took it that you intended it to be a "backyard instance" of "dual numbers")
The computations in transformers are actually generalized tensor tensor contractions implemented as matrix multiplications. Their efficient implementation in gpu hardware involves many algebraic gems and is a work of art. You can have a taste of the complexity involved in their design in this Youtube video: https://www.youtube.com/live/ufa4pmBOBT8
> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
Elegance is a silly critique. Imagine instead we were spending trillions on floral bouquets, calligraphy, and porcelain tea sets. I would argue that would be a bad allocation of resources.
What matters to me is whether it solves the problems we have. Not how elegant we are in doing so. And to the extent AI fails to do that, I think those are valid critiques. Not how elegant it is.
"Creeping elegance", I guess: https://en.wiktionary.org/wiki/creeping_elegance
But elegant can mean minimal, restrained, parsimonious, sparing. That's different from a bunch a paraphernalia and flowery nonsense.
You mean tech-debt ridden spaghetti code that works now is as good as an elegant correct and an efficient code ?
You're right. It's all about solving problems.
Maybe we need a word that, when applied to mathematical concepts, describes how simple, easy to understand and generally useful a solution or idea is.
I wonder what that word could be.
the aesthetics of math and physics is by far the most boring discussion that can be had. i used to be utterly repulsed by such talk in undergrad - beauty this and that. it absolutely always felt affected and put on - as if you talk about it enough, you'll actually convince people outside of the major to give you the same plaudits as real artists.... yea right lol.
That’s the opinion of one person. There are many mathematicians who find beauty in the maths they are studying.
I will emphasize your point more forcefully. All mathematicians I know work on what they work because of the beauty and aesthetics in their field.
Much like sex. Sex has reproductive utility but that's not why most people engage in it. Those who do are are missing much.
Notion of beauty for a mathematician is quite specialized. It's the difference between spaghetti code that works and an elegant and efficient code that is correct. They are easy to build upon efficiently.
> Much like sex.
My guy you know lots of people in here have read Feynman right? You should cite him instead of pretending you were clever enough to come up with the analogy yourself.
Quite the contrary. I expect majority of HN readers to know the quote, base 0 if you will, and not harbor thoughts that by having read it they are a part of an exclusive club.
Channeling Good Will Hunting much huh? Most HN'ers would have watched that too.
I have no idea what you're trying to say - it is generally understood everywhere in the world (ie all forms of human culture) that it's pathetic to pass off someone else's insights as your own.
Only when there is an expectation of being perceived as original. When people use differentiation people don't cite Newton or Leibnitz, do they ?
.... Of course they do? You can still find citations of those papers to this day.
For all of your "forceful" comments on math, I think probably you don't actually know much about it.
> You can still find citations of those papers to this day.
That's not what I contested. What fraction of people who use differentials in their published work still cite Newton or Leibnitz was the point. You can count number of such citations in last 10 years of say neural nets literature, or applied maths literature and report. Thats plenty of use.
Citations to their differential calculus that are still made are mostly in the context of history of math.
Seems numeracy or comprehension is not your strong point. LOL.
> I have no idea what you're trying to say
Now I don't doubt it. LOL
> What fraction of people who use differentials in their published work still cite Newton or Leibnitz was the point
Those papers were written in the 1600s. "The character of physical law", the essay you're ripping off, was written in 1964. 100% papers from the 1960s are cited every single time the techniques are used.
You are as tedious as the original refrain I was complaining about (which is not at all ironic). What's most tedious is you're not actually a mathematician but presume to speak for them.
You are changing goal posts now. Your absolute claim was
> it's pathetic to pass off someone else's insights as your own.
To which my point was citations are made when there is an expectation of originality. By now Feynman's anecdotes are folklore and folks wisdom.
OK let's go by your standards. Cooley Tukey's FFT algorithm was "discovered" by them in around 1965. How often do they get a citation when FFT is used, especially in comments on a social site, such as HN is.
LOL even 10 years old results do not get cited because they are considered common knowledge.
That said, Witt's notion of beauty that Propp is critiquing in the posted article is just plane idiotic. Lack of commutativity is not lack of beauty. What a stupid idea.
Mathematical beauty and imagination is different. One of Hilbert's grad students dropped out to become a poet.Hilbert is reported to have said: 'I never thought he had enough imagination to be a mathematician.'
A little unsolicited advice: if you are an aspiring mathematician(I am very happy for you if you are), but if you do not have a sense of a good taste or mathematical beauty, you probably will probably not have a good time.
> if you are an aspiring mathematician, and you do not have a sense of a good taste or mathematical beauty, you probably will probably not have a good time
Lol I have a PhD from a T10 and 15 published papers. I'm pretty sure I don't need your advice on "taste" or "beauty".
Yup. PhD is a start.
My condolences though, for being in a line of work where you don't perceive beauty.
I'm a researcher in FAANG - I'll keep in mind your condolences when I get my yearly RSU re-grant lololol.
You're equivocating over the verdict, here. Are they right?
> There are many mathematicians who find beauty
Lol did you think this was clever? You just literally reiterated exactly what I said. See, if you had said "there are many pianists that find beauty in math" - you know like how many mathematicians find beauty in piano concertos - then you'd have me.
Pianists don't find beauty in written maths, but mathematicians don't usually find beauty in sheet music either. It is the performance, accesible to our senses, that can convey beauty even to amateurs.
Accidentally - in the parts of maths where the concepts can be visualized, such as fractal theory, non-mathematicians seem to love what they see.
> written maths
Absolutely no one when they're navel gazing on this topic is discussing the aesthetics of notation.
> seem to love what they see.
Nor visualizations
More like watching a weaving machine than watching a person hammer nails imo. Maybe like an old-time mill, with several machines if you think in terms of actual processing on an accelerator?
There's a wooden weaving machine at a heritage museum near me that gives me the same 'taste' in my brain as thinking about 'matrix' processing in a TPU or whatever.
The commutation problem has nothing to do with matrices. Rotations in space do not commute, and that will be the case whether you represent them as matrices or in some other way.
Maybe I'm just being ai-phobic or whatever but I strongly suspect the original article is written by grok based on how it goes off on bizarre tangents describing extremely complicated metaphors that are not only inaccurate but also wouldn't in any way be insightful even if they were accurate.
Well function composition f(g(x)) is not the same as g(f(x)) and when you represent f and g as matrices relative to some suitable set of basis functions then obviously AB and BA should be different. If the multiplication was defined any different, that wouldn’t work.
The way that I used to put this was, "If I put on my shoes before my socks, I'll get a different result than if it I put on my socks before my shoes. Order of operations matters."
Quaternions are beautiful too until you sit down to multiply them.
Matrices represent linear transformations. Linear transformations are very natural and "beautiful" things. They are also very clearly not commutative: f(g(x)) is not the same as g(f(x)). The matrix algebra perfectly represents all of this, and as a result, FGx is not the same as GFx. It's only not "beautiful" if you believe that matrix multiplication is a random operation that exists for no reason.
So, Hardy focused on good explanations, and that was what he meant by beauty. Fair enough. The best objective definition of beauty I know of is "communication across a gap". This covers flowers, mathematics, and all kinds of art, including art I think is ugly such as Lucian Freud and Hans Giger I guess. So now I'm describing some things as beautiful and ugly at the same time, which betrays that there's a relative component to it (relative, objectively). That means I wish some things - including mathematics, which is usually tedious - communicated better, or explained things that seem to me to matter more: I feel in my gut that there's potential for this. So I don't rate mathematics as beautiful, any of it, personally.
But I'll admit its barely beautiful. Within which context, I guess the article's lawyering for the relative beauty of a matrix was a success, but I always liked them better than calculus or group theory anyway.
Beauty,symmetry,etc are largely irrelevant, the key point it does not scale and burning gigawatts to compute these matrices(even with all those tricks) will not scale or compete with more efficient/direct methods in the long term. Perhaps transformers are very elaborate sunk-cost fallacy where pivoting to scalable, simpler architecture is treated as "too risky" even when cost of new GPU cluster dwarfs whatever it takes to bring an architecture from 0 to chatGPT level.
The whole issue with this industry is that it moves so fast, there is no "long term." You're either in all the way in a likely futile attempt to capture the market or you're not in at all. So you also don't have time to really innovate on the hardware or software level and you need to put everything into training data and training hardware.
I am willing to admit that I find matrix multiplication ugly, as well as non-intuitive. But, I am also willing to admit that my finding it ugly is likely a result of my relative mathematical immaturity (despite my BS in math).
Maybe it helps to think of matrix multiplication as a special case of the composition of linear transformation. In the finite dimensional case they can be expressed as matrix multiplications.
I guess Stephen Witt must not like subtraction either, since a-b =/= b-a. Nor division.
I just finished reading lots of Stephen Witt quotes on goodreads. He comes across as a white Malcolm Gladwell, except that he actually does know what "Igon values" are so I don't know what his excuse is.
> white Malcolm Gladwell
I'm intrigued. How would a white Malcolm Gladwell's quotes differ from the IRL Malcolm Gladwell?
Different hair, mostly.
Don't like matrices? Introducing: Penrose abstract index notation. Or "I can't believe it's not matrices".
Matmuls (and GEMM) are a hardware-friendly way to stuff a lot of FLOPS into an operation. They also happen to be really useful as a constant-step discrete version of applying a mapping to a 1D scalar field.
I've mentioned it before, but I'd love for sparse operations to be more widespread in HPC hardware and software.
Do you disagree with my take or think I’m missing Witt’s point? I’d be happy to hear from people who disagree with me.
I think 4x4 matrices for 3D transforms (esp homogenous coordinates) are very elegant. I think the intended critique is that the huge n*m matrices used in ML are not elegant - but the point is made poorly by pointing out properties of general matrices. In ML matrices are just "data", or "weights". There are no interesting properties to these matrices. In a way a Neumann (https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant) Elephant. Now, this might just be what it is needed for ML to work and deal with messy real world data! But mathematically it is not elegant.
I think you're right that the inelegant part is how AI seems to just consist of endless loops of multiplication. I say this as a graphics programmer who realized years ago that all those beautiful images were just lots of MxNs, and AI takes this to a whole new level. When I was in college they told us most of computing resources were used doing Linear Programming. I wonder when that crossed over to graphics or AI (or some networking operation like SSL)?
What could any complex phenomenon possibly be other than small “mundane” components combined together in a variety of ways and in immense quantities?
All such things are like this.
For me, this is fascinating, mind-boggling, non-sensical, and unsurprising, all at once.
But I wouldn’t call it inelegant.
> When I was in college they told us most of computing resources were used doing Linear Programming.
I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
I have not only used linear programming in the industry, I have also had to write my own solver because the existing ones (even commercial) were to slow. (This was possible only because I only cares about a very approximate solution)
The triangulations you mention are important in the current group I'm working in.
The first two results from Google with "Voronoi astro" gave two different uses than the one I knew about (sampling fibre bundles): https://galaxyproject.org/news/2025-06-11-voronoi-astronomy/ https://arxiv.org/abs/2511.14697
Astronomy is pure research and is performed almost exclusively by academics.
I’m not saying these things have zero utility, it’s just that they’re used far less frequently in industry than academics imagine.
And astronomy tends to throw up technology that becomes widely used (WiFi being the obvious example) or becomes of "interest" to governments. I expect that AMR code will be used/ported to nuclear simulations if it proves to be useful. Do I expect it to be used in a CRUD app? Obviously not, but use by most software shops isn't a measure of importance.
3d modelers would like to have a word with you.
Part of the reason why linear programming does need t get used as often is that there are no industry standard software implementation that is not atrociously priced. Same deal with Mathematica.
I think it conflates the map and the territory.
Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
Elegant territory. Inelegant, brute-force, number crunching map.
The inelegance to me isn't in the definition of the operation, but that it's doing a huge amount of brute-force work to mix every part of the input with every other part, when the answer really only depends on a tiny fraction of the input. If we somehow "just knew" what parts to look at, we could get the answer much more efficiently.
Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
If the O(n^3) schoolbook multiplication were the best that could be done, then I'd totally agree that "it's simply the nature of matrices to have a bulky multiplication process". Yet there's a whole series of algorithms (from the Strassen algorithm onward) that use ever-more-clever ways to recursively batch things up and decrease the asymptotic complexity, most of which aren't remotely practical. And for all I know, it could go on forever down to O(n^(2+ε)). Overall, I hate not being able to get a straight answer for "how hard is it, really".
For anyone interested, there is a introductory survey of the current lower bound at: https://en.wikipedia.org/wiki/Computational_complexity_of_ma...
Maybe the problem is that matrices are too general.
You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
Ignore me then because I agree with you. :) He sounds like someone who upon first hearing jazz to complain it was ugly.
> sends the pair (x, y) to the pair (−x, y)
I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
Let me try.
In US schools during K-12, we generally learn functions in two ways:
1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
Thanks for pointing this out. I’ll work on this passage tomorrow.
No. It's not ugly.
I think it is just a matter of perspective. You can both be right. I don't think there is an objective answer to this question.
The author has exclusive claim to their own aesthetic sensibilities, of course, but the language in the piece suggests some degree of universality. Whereas in fact, effectively no one who is knowledgeable about math would share the view that noncommutative operations are ugly by virtue of being noncommutative. It’s a completely foreign idea, like a poet saying that the only beautiful poems are the palindromic ones.
One could say that it depends on your basis...
Honestly, in a purely technical sense, I do find it beautiful how you can take matrix multiplication and a shit-ton of data, and get a program that can talk to you, solve problems, and generate believable speech and imagery.
There are many complications arising from such a thing existing, and from what was needed to bring it into existence (and at the cost of whom), I'll never deny that. I just can't comprehend how someone can find the technical aspects repulsive in isolation.
It feels a lot like trying to convince someone that nuclear weapons are bad by defending that splitting an atom is akin to banging a rock against a coconut to split it in two.
IIRC, working with matrices was much easier using FORTAN, I would expect modern fortran kept that 'easiness'.
maybe the issue boils down to overloading the term "multiplication". If mathematicians instead invented a new word here, people would get tripped up less (similarly for 'dot' and 'cross' "products")
i think a lot of issues arise from using analogies. Another one us complex numbers as 2D vectors. Its an ok analogy.. Except complex numbers can be multiplied where are 2D coordinates can not. Your weird new nonvectors are now spinning and people are left confused
>Matrix algebra is the language of symmetry and transformation, and the fact that a followed by b differs from b followed by a is no surprise; to expect the two transformations to coincide is to seek symmetry in the wrong place — like judging a dog’s beauty by whether its tail resembles its head.
The way I've always explained this to non-algebra people is to imagine driving in a city downtown. If you're at an intersection and you turn right, then left at the next intersection, you'll end up at a completely different spot than if you were to instead turn left and then right.
Matrix multiplication libraries are ugly. They either give up on performance or have atrocious interfaces ... sometimes both.
Using matrix multiplication is also ugly when it's literally millions of times less efficient then a proper solution.
What’s the proper solution for computing the voltage and current flows in a component network, order than modified nodal analysis?
I'm not familiar with that particular problem, but I did use a load-bearing "when".
Can you give an example of when it is not appropriate?
Literally 99% of the crap they're shoving AI into these days.
Anyone who thinks matrix multiplication is ugly has understood nothing about it.
I doubt anyone of the past or present could fully describe what a matrix is, and what its multiplication is. There are many ways people looked at it so far - as a spatial transformation, dot products and so on. I don't think the description is complete in any significant way.
That's because we don't fully understand what a number is and what a multiplication is. We defined -x and 1/x as inverses (additive and multiplicative), but what is -1/x ? Let's consider them as operations. Apply any one of them on any other of them, you get the third one. Thus they occupy peer status. But we hardly ever talked about -1/x.
The mathematical inquisition is in its infancy.
This is just low brow philosophical sounding rubbish of the same variety as "what is 'is'" nobody knows. Sounds profound though.
Matrix is just one way to organize data. When linear operators are organized this way composition of linear operators map to matrix multiplication.
But that is just one of the ways that multiplication may be defined on matrices, Hadamard products, Tensor product, Khatri-Rao product are some of the other examples. They all correspond to different mathematical structures one wants to explore or use. If linear algebraic structures is what ones to explore or use then one gets matrix multiplication.
As someone who never got deeply into math but deeply into programming they just seemed like an incompletely generalized data structure with an interesting "canonical" algorithm that can be used on it. In some cases, if you arrange your data into the structure correctly, you can use it to model interesting real world phenomenon.
It feels like Linear Algebra tries to get at the heart of this generality but the structure and operator is more constrained than it ultimately could be. It's a small oddball computational device that can be tersely written into papers and widely understood. I always find pseudocode easier to follow and reason about but that's my particular bias.
I get your point, but i think the real issue is -(1/(-1/x)). It is the one that is being overlooked the most in our society, as if it were something normal, but it contains some of the deepest truths imho.
how about -1/(-(1/(-1/x))) ? How many roads must a man walk down before we can call him a man ?
No need of walking, they just need to be able to read post properly before calling him a man.
No you didn't get it. You missed "Let's consider them as operations. Apply any one of them on any other of them, you get the third one."
So is what i wrote a third one? Fourth? Fifth? :)
Not sure what you are talking about. What you wrote reduces to just x. What I meant was, if you substitute say, -x for x in -1/x, you get 1/x, which is the third inverse. Same is true for the other two pairs. So, if we call them functions f, g and h, then, f=g(h)=h(g); g=f(h)=h(f); h=f(g)=g(f)