What it feels like to work with Mythos

(oneusefulthing.org)

336 points | by swolpers 21 hours ago

58 comments

  • eithed 19 hours ago
    What I find fascinating that there is so little substance in this article about the quality of produced code and the medium. Is the code documented and tested? Is it understandable and extendable? Is it secure? What language, framework, database was used? Author mentions judgement and taste - well, is the code tasteful? Will the model rearchitecture the entire thing if I ask it to add new functionality, spending another 9.5h in tokens? I assume that the research part is domain knowledge = how different types of travel translate to time making it presentable; how did the author verify this?

    These questions are even not about AI: if I were to give money to a human agency and were given something they tell me works, I would ask the same questions. If I did not know how to evaluate, I would hire people that do. With LLMs the verification part is what bothers me the most.

    • an0malous 17 hours ago
      These posts are never written by software engineers, it’s always some tech exec, retired engineer, or VC. This author is apparently a professor at the Wharton School of Management? None of these people have to ship or maintain real products, they’re just making side projects.

      The only decent software engineering perspective I’ve seen has been from Mitchell Hashimoto.

      • jimbokun 17 hours ago
        Well that’s kind of the point.

        They can just summon bespoke software out of the ether that only handles the use cases of themselves and a few of their collaborators.

        Making “side projects” was mot possible for non-developers before powerful LLMs. Now it is.

        • an0malous 16 hours ago
          I don’t think that’s true, I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering. This author describes how complex and sophisticated their software is, and the only value he’ll concede to “coders” is that there might be a few bugs they’d need to fix.

          Imagine not being an architect and using Claude to put together a building plan, then concluding it’s basically done but we might need a real architect to double check the measurements. It may even be true but I’d be skeptical if it’s always non-architects saying this.

          • 21asdffdsa12 5 hours ago
            And - we kind of have been here before. The "proto"-type is almost complete. Its just a little slow, a little spaghettificated, just written in excel-vb, clicked together in node-graphs, or the next hot thing that makes coding unnecessary.
          • bathtub365 13 hours ago
            Why do they even need coders to fix these bugs? It would be an order of magnitude (at least) to ask Claude to find and fix them, and it will likely be successful.

            Building in the physical world has physical and time constraints that cannot be overcome, which is one of the reasons architecture (and engineering) are so important in this domain. In software development these constraints were only inherent when people were writing the majority of the software. I feel like I’m seeing what I thought were fundamental constraints being eroded by the increasing speed and correctness of these tools and it’s making me reconsider the importance of some of the values that are held by software engineering.

            It’s obviously dependent on the domain and solution, but if your software can be extremely rapidly rearranged, bugs found and fixed with little effort, and features added with only a minimum prompt, I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution. I also worry that, as a software developer, if more can be accomplished in less time there will be less room on this planet for software developers.

            • dv_dt 2 hours ago
              It's quick to build a hut in a green field, but slow to remodel the expanded building after. I think that will remain true regardless of if a team of sw developers are doing it, or an AI with a product manager or somewhere in between.
            • phil21 10 hours ago
              > I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution.

              This very well summarizes my current thinking on the subject as well. And most of my career has been playing the role of technical debt nazi. Much to the detriment of my earning potential.

              Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.

              I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.

              It won't take over everything, but I've already seen otherwise very intelligent go-getter type folks who are not technical or know how to code made extremely useful things for themselves and their small little enterprises. And this will seemingly only get better and more efficient.

              For someone who really does love the idea of well architected and future-proof code this is just icky to even say or consider. But I'm coming around to this is the future for the majority of software for most places. And it may have the ability to seriously even the playing field for small enterprises in some industries.

              I'm currently using it to implement a zillion side projects at home I've been "meaning to get to" for years. It makes incredibly silly unmaintainable code most of the time - but I learned to not care, and just tell the AI bot to fix it/add to it as I go along. Worst-case I spend a single night deleting it all and starting from zero to "refactor" an entire thing.

              • prmoustache 9 hours ago
                > I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.

                I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?

                What if to fix 2 bugs your LLM starts adding 50 new ones? Will you tell your customers in supports channel "sorry software is finished, if we try fixing anything, everything else might break, not worth it". Or "we can probably fix it, but our AI usage will raise so much we need to up the subscription 3 fold, you choose".

                The speed at which LLM codes is only comparable to the speed at which they add garbage to your repo. If you stop caring about maintainability, you also stops caring about your AI/LLM related bills and the viability of your project past the PoC stage.

                • varjag 4 hours ago
                  The GP explicitly mentioned "end product is not selling software". But even then, bugfixes introducing new bugs are not unheard of before. Most code used to be mediocre quality so there's not a sea of change with AI. Perhaps it even becomes better on average.

                  Another thing though is selling software in the first place will soon become tough proposition outside of a few niches.

                • senordevnyc 2 hours ago
                  I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?

                  There's no reason to think that quality and maintainability will start falling exponentially. On the contrary, these models get better every couple months, and 99% of software isn't actually that complicated. There's just no reason for the fear-mongering that fixing 2 bugs will cause the LLM to add 50 new ones.

                  • queenkjuul 16 minutes ago
                    Except that i witness it create new bugs while fixing existing ones?

                    Not 50:1 but it does happen

              • senordevnyc 2 hours ago
                I think many software engineers forget they exist to get real things done

                One billion percent. I think the vast majority of the anti-AI sentiments I hear from software engineers comes down to them caring more about playing with their tools than actually solving the problem.

              • locknitpicker 8 hours ago
                > Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.

                This hits the nail in the head.

                Detractors often hang on to examples of coding assistants making mistakes or output subpar code, but they somehow miss the fact that coding assistants can also be prompted again and refactor whole swaths of code just as fast as they introduce oopsies. This means that the worst case scenario implies fast convergence to an acceptable outcome, and from there also fast iteration to improve upon that.

                • rglullis 5 hours ago
                  The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.

                  The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering. Let engineers really own the architecture. Enforce that any new feature - no matter how small - to be specced out with complete sequence diagrams. Ensure that every new software package needs to be put on an UML component diagram for the team to review and see each addition interacts with the whole system, etc.

                  If we do that, then we can just give all the documents to a coding agent and say "go ahead and implement this" with a minimal amount of confidence. But in doing this, I bet we will realize the following:

                      - the "effort" has never been about writing code itself. The code is just the material manifest of all the thought that went to think over a solution into the problems that the product is attempting to solve.
                  
                     - we will likely be better off by using code generation tools (i.e, UML-to-code) and a "weak" LLM (than can run locally) than by playing the token lottery at the Anthropic Casino.
                  • eithed 3 hours ago
                    I mirror your thoughts. I think we'll end up with "perfect map" paradox = you cannot be vague or indecisive on what you want (and if you are then these decisions don't matter) and you're creating a 1:1 representation of what the code needs to be.

                    I'd substitute "owner" for the team and in that sense the owner will not need to be human.

                    We're at this state where Claude is great at doing the "middle" part of work, but it's crap at gathering requirements and verification of what it has done. I also don't see people caring about these aspects of software development as shown in the article

                  • locknitpicker 2 hours ago
                    > The problem is that this approach is not sustainable. Errors compound. The cost to fix one issue might seem small at first, but over a stretch of time all these "oopsies" become architectural spaghetti that can only be fixed with a complete rewrite, which will certainly become more expensive than getting the code "organically" developed.

                    That's so far been called software development.

                    All software developed by people suffers from this issue.

                    Where exactly is the novelty?

                    > The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering.

                    Nonsense. The problem is exactly the same.

                    With agents iterations are much faster, and this can mean things can get messier faster but can get in shape just as fast.

                    Ironically, agents improve the quality of the deliverable as well. Approaches such as spec-driven development do a far better job delivering features up to spec than manual coding by flesh and blood developers.

                    There's an awful lot of baseless scaremongering in your post. You make it sound like with AI assisted coding developers stopped paying any attention to quality.

                    • shakna 27 minutes ago
                      > Where exactly is the novelty?

                      The compounding speed. Your devs might reach a point where they have to rewrite and refactor, in a decade.

                      Your LLM, with its higher throughput, may put you in that game breaking situation next week.

                    • skydhash 34 minutes ago
                      > All software developed by people suffers from this issue.

                      And that’s pretty much where you are wrong. Takes any long running open source projects and you can see the craftsmanship that goes into them. It may not be perfect, but hacks are clearly marked as such.

                    • rglullis 1 hour ago
                      [dead]
                • dkersten 7 hours ago
                  I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues. I’ve seen SOTA models make ridiculously stupid architectural decisions that they were then unable to back out of without being prompted very specifically, instead adding a patchwork of “fixes” on top.

                  I’m not saying that you can’t use AI to do it because I believe that with carefully controlled workflows and context management you can, but it’s not a simple prompt away, it’s requires guidance and understanding, and isn’t the speed demon that raw prompting is.

                  • locknitpicker 7 hours ago
                    > I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues.

                    That's not really the point though. That presumes models are only useful if they are one-shot models. That is false.

                    I mean, what if your prompt successfully changes 20 source files and makes a mess in one? How much work did it saved?

                    And the elephant in the room is when models actually outperform whatever the prompter is able to deliver, and faster. That is somehow left out.

                    • dkersten 7 hours ago
                      > That presumes models are only useful if they are one-shot models

                      That’s not at all what I’m saying.

                      I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.

                      I want them to be more useful outside of one-shot uses, but I find that they currently miss the mark.

                      • locknitpicker 1 hour ago
                        > I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.

                        That's not my experience at all, and I have been using models that are far from being cutting edge. Even in the cases where a model generates utter nonsense, a couple of clarifying questions is all it takes to get it back on track.

                        But that might be a factor of the project being worked on, and the extension of the changes being asked.

                • asoderlind 7 hours ago
                  I think this is overlooking the fact that assigning a coding assistant to fix the bugs it re-introduces for all eternity just leads to spiraling token costs, which might cost more than just hiring a competent engineer in the first place.
                  • sfn42 1 hour ago
                    This has been a debate for ever, long before LLMs. On the one hand you have people who don't care, on the other you have people who produce good code.

                    Doesn't matter how fast you can make the wrong thing.

                • eithed 7 hours ago
                  Don't forget that you can adjust your requirements (either via plan or skill) to ensure the mistakes do not happen. The problem is that neither LLMs, nor humans (that don't work with the domain) will know they made these mistakes. Even coders don't think about everything all the time
                  • brazzy 4 hours ago
                    > Don't forget that you can adjust your requirements (either via plan or skill) to ensure the mistakes do not happen.

                    No, you can't. Adjusting prompts ensures absolutely nothing.

                    • eithed 3 hours ago
                      I disagree. What I should have added is that with agents (as well as humans) you do need to have tests that verify what was done.
                • vrganj 7 hours ago
                  In my experience, the refactors are just as bad, just in different ways. All you end up doing is treading water with different iterations of shitty code. By the time you get somewhere acceptable, you could've just fixed it up yourself.

                  My preferred workflow these days is to pair program with an LLM until it gets close-ish and then manually touch it up. Without that, it just produces junk in different forms.

            • mitxela 2 hours ago
              Technical debt remains the same. LLMs are found not to work as well when editing messy codebases - exactly the kind you get after using an LLM for a while. After a few weeks or months you have to either throw it away and start over, or involve a human at exorbitant prices.
          • squidbeak 5 hours ago
            > I think these authors are making a much stronger claim that AI is proficient or even an expert at software engineering.

            The author specifically says:

            > I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly (which is one reason we may need more, not less, coders in the future, to help with the explosion of new uses for software)

            which acknowledges pretty clearly that engineers bring a level of insight and experience still missing from Mythos. Saying that, I totally disagree with his contention that this will always be true. It's pretty weird that the author of an article stressing the steep improvements in a model's capability can't seem to imagine further improvements in that capability. As if Mythos is where development ends or whatever gap remains between models and experts won't steadily narrow or eventually widen in reverse.

        • bandrami 9 hours ago
          Well, right, but if the real use case for LLMs is "making software that wasn't economical to make before" that's bearish for the labs because it means they're only going to be chasing the low end of the market.
        • SpicyLemonZest 11 hours ago
          It is, and it's cool that it is, but the calibration is important. Statements like this:

          > With Fable the spell has gotten powerful enough that I am no longer sure I am the wizard. I am closer to a patron. I describe what I want, I pay for it, and I judge the result. The conjuring happens somewhere I cannot watch, in hundreds of small choices I never get a vote on. The work has shifted from process to outcome. I no longer steer; I commission.

          have a very different meaning coming from a non-technical researcher than they would from someone who builds software for a living.

        • shimman 16 hours ago
          Making side projects isn't a trillion dollar industry tho, adding to the fact that we are facing another global supply chain crisis due to the Iran War; the US is about to commit the biggest self-own ever in the history of empire.
          • Schmerika 4 hours ago
            There are actually quite a few trillion dollar industries that exist thanks to "side projects".

            Apple was Woz's side project, once upon a time. Adsense came from Google's 20% time. Social media started as a side project.

            Forests grow from trees. Trees grow from seeds. More potential seeds = more potential forests.

            • queenkjuul 6 minutes ago
              All the undiscovered Woz's of the world add up to a trillion dollars? There's $1T of money out there waiting to be spent on side projects?

              The question was "are side projects a trillion dollar industry" not "has a side project ever started an industry"

              How much of a new $1T software product will anthropic capture in token costs, anyway?

          • zelphirkalt 6 hours ago
            The US has been on a course of self-owns ever since Trump got into office. That they still are a dominant power on the globe shows how much they were one before Trump, but it seems to be changing. At every self-own they commit, China laughs and inches up a little closer. I think we will see the day, when they are evenly matched in our lifetimes.

            But which self-own exactly do you mean, of the many there are?

    • cgearhart 18 hours ago
      I’m starting to realize that LLMs are really good at building low-stakes projects. Your questions mostly presume that the stakes are higher. The software will last a long time; the requirements will evolve; we can’t tolerate mistakes; etc.

      The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.

      • qaq 18 hours ago
        You don't need LLM for that. You make _all_ projects low-stakes by working on green field project using (insert buzzword soup of the day) and leaving for a new green field opportunity (that requires experience with buzzword soup of the day) before the project ships.
        • DrJokepu 12 hours ago
          No, what you’re describing still requires you to do some actual work, and also, while you work there, there is still some level of accountability. A much, much better grift is coaching.

          Like, an AI coaching session for executives at the yearly executive retreat. You show up, spend a few hours going through some nonsense slides ChatGPT put together for you, you charge an eye watering fee for it, HR or whoever organizes it will gladly pay for it because it will make them look all cutting edge in front of the CEO, by the next day everyone will forget about it. No accountability at all!

        • majormajor 10 hours ago
          In the LLM world you never get a chance to get paid to work on those greenfield projects because the person with the idea is churning the prototyping and discovery work themselves.

          If you want to get paid to work on software, you get involved after its found success and the stakes get higher.

          (Which assumes there are still significant areas where economies of scale reward that vs everybody just having their own DIY version of everything.)

          • mcv 1 hour ago
            You've got to be the person with the idea. I'm currently doing that. I spent the past year working on a frustrating project where everybody else did everything wrong, so now I'm building it on my own, hoping to sell it to them. (No idea if that will work)
          • owlbite 9 hours ago
            Or economies of liability and buck passing. I suspect managers and businesses will still want to be in the game of "not my fault, supplier is working on it, we can sue them if they don't meet SLA".
      • dchftcs 12 hours ago
        If there's a viable way to make all projects low-stakes we'd have done it. Consider this: microservices.
      • rpdillon 16 hours ago
        This is really insightful, but I think it also extends to making the project either low stakes or low complexity. I have this lurking feeling that the preferable architecture for software will change as a result of LLMs because they're good at working on low complexity modular components more than they are on high complexity million-line code bases.
        • ncruces 12 hours ago
          You'll just shift complexity to the orchestration of the modular components.

          Monoliths vs micro-services.

        • majormajor 10 hours ago
          They aren't necessarily as great at building low-complexity high-modularity components, though. ;)

          Unless you know enough to tell them to! And keep them honest about it...

      • skywhopper 3 hours ago
        But not all projects can be low stakes. None of the important ones are.
      • acedTrex 18 hours ago
        > The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.

        this doesn't really work in the real world. There are many things that actually matter, engineering is fundamentally about handling them.

    • soraminazuki 13 hours ago
      Welcome to every LLM discussion in the past 2 years or so. When asked for anything of substance, we're faced with a barrage of "but humans aren't good at this too!" Very few quantifiable evidence and lots of pure rhetoric.
      • skydhash 13 hours ago
        I’ve seen this pattern again and again, and I don’t bother replying. There’s also the “strong statement, and when you contradict it, they point out some particular circumstances that no one cares about”.
        • munksbeer 5 hours ago
          I think a lot of us have stopped talking to each other about this. I see it the other way round to you. I see constant scepticism and doubt that LLMs can build anything useful, and whenever provided with examples, the goalposts just move.

          And at my own firm, I think every developer is generating most of their code using agentic coding. We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase. That is because writing the changes and raising the PRs is much faster, but also a lot of boring admin and support work is now mostly done by LLMs. Reports of instability, vague client requests, etc? Throw the LLM at them and it usually figure it out why I continue to engineer.

          So I know, first hand, that these things are very good. I also know second and third hand that pretty much every fintech in the industry is as heavily using agentic coding as we are.

          And then I come to HN or reddit and I see people telling us that they cannot write decent production code, and this is just wrong. This isn't opinion wrong, it is objectively wrong. Any fintech that wants to keep up will tell you this.

          I can't speak for other industries but I can't imagine they're different.

          So, I'm not sure what to conclude from this. I don't want to be uncharitable, but when HN/reddit posts just don't match the reality I see for myself, I have no choice but to categorise them as being emotionally driven to stick to a particular narrative, and so I can dismiss them.

          • queenkjuul 0 minutes ago
            I use Claude Code at a fintech, and I'm seeing garbage PRs from careless coworkers all the time. I'm having to correct Claude output regularly.

            Yes, it does nearly all the typing for me now. But left to its own devices, it'll happily spit out awful code.

          • teliosix 1 hour ago
            It is all the same narratives from around the invention of the power loom if you look into it.

            What I take from that time also is that the hand loom weavers were not incorrect. The power loom did not do as good of a job as they did by hand.

            You can still by a hand woven shirt today at a premium price.

            There is a category error as if quality is the product as opposed to one input of the product.

            You probably don't get to be a master craftsman without that quality mindset so they aren't wrong but missing the forest for the trees.

          • skydhash 3 hours ago
            > I see constant scepticism and doubt that LLMs can build anything useful, and whenever provided with examples, the goalposts just move.

            > I see people telling us that they cannot write decent production code, and this is just wrong.

            At least for me, that has never been the counterpoint that I’ve been making. I’ve never cared about code itself, especially with languages like Java and Kotlin, where you basically autocomplete most of the code, and with SDK like ios where you can collect snippets for most of the patterns that you need. And with frameworks like Laravel, where most big additions are done with the tooling. And because code is so repetitive, editors like emacs and vim have lots of features and plugins to help with copying and pasting (registers, macros, navigation, snippets,…)

            And the fact is some code you wrote today will be worthless tomorrow and will be replaced and deleted. So, it’s very rare to care about some particular snippets or patch of code.

            What myself, and others, have been complaining about is the quality of the codebase and the sustainability of the practice. Especially with the associated claims about increased productivity.

            I care about correctness. Simplicity and reduced amount of code increase my confidence that I can achieve it. New features, until tested in production, are more probable to decrease the reliability of the software. And with each fix for a bug, I need to make sure that I’m not adding five more.

            To this day, I’ve not seen any compelling arguments that is about writing better code reliably. I’ve seen a lot about writing more code. It’s like manager thinking if you’re not at your computer typing, you’re not working.

            > We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase

            Are you seeing a quality increase? Less customer bugs, less outages, faster resolution? Are you measuring those?

            • munksbeer 49 minutes ago
              > Are you seeing a quality increase? Less customer bugs, less outages, faster resolution? Are you measuring those?

              We're not at the stage to measure yet. We may be behind others, not sure. Actually, this isn't quite true. I was interested, so a created an ad-hoc report (with AI) on PRs landed per week over time. This has gone up over the last 6 momths. But that is hard to say why that is. It might just be people are raising smaller PRs because it becomes easy to have the AI split things up, while before, people were too lazy to do this.

              Our bottleneck is still that we want humans to review. Sometimes we spot errors, but our pre-existing testing frameworks are very robust already, so if these pass, we're very confident to release to production, and the agent is excellent at understanding the existing testing frameworks and adding to them for new stuff.

              So in our team, we don't often see blatant logic errors. It is mostly to do with things like using a pattern that is used elsewhere in the codebase (or not at all) and doesn't belong in our specific section of the code (we have a large monorepo). These become fewer as we enhance our ruleset (AGENTS.md or CLAUDE.md) for our particular developers.

              • skydhash 0 minutes ago
                > And then I come to HN or reddit and I see people telling us that they cannot write decent production code, and this is just wrong. This isn't opinion wrong, it is objectively wrong

                So how can you justify this comment of yours from your reply if you’re not measuring anything? Mind you, I can easily get good results from AI tools, but I don’t like the experience and the code is often over-engineered and drifts away from my target architecture.

                But the worst is quickly loosing sight of the tiny technical details that matters when solving bugs or altering features. I don’t like typing code. What I like is to be able to go directly to the code that I need to change, modify it, and then verify that it works. Most of my time is spent deep thinking about the design of the software which is orthogonal to code.

                And if there is one thing that is common about people fully onboard with LLM is that they can talk about the product, but they can’t argue about its behavior and its correctness. There’s no intrinsic model that they can compare with the real code. They don’t know the edge cases, the technical pitfalls, how the software will react if you modify one component. Any brainstorming session quickly turns into a slog because they cannot contrast approaches anymore. You can see the decay of understanding in realtime.

      • viking123 4 hours ago
        Yeah, never concrete examples from these guys.

        I am creating a game and I can say that with the coding part the models help a lot, mostly gpt 5.5 high. Tbh to me all the frontier models feel the same and they can all solve the stuff I do quite well with some guidance and prompting. But that kind of makes me appreciate the other stuff more like visual style, sound design, mechanics etc etc. Tons of work still.

        For brainstorming I find the models bad nowadays or maybe I am just too critical of the results

    • coldtea 18 hours ago
      >What I find fascinating that there is so little substance in this article about the quality of produced code and the medium.

      I clicked one of his examples intrigued "a snake game where the snake is self-aware and crazy things happen;". Played for 1-2 minutes, and it's the classic 1980s snake game. Am I missing something? What is "self-aware" about it? Some funny messages at the bottom of the screen? And what are the "crazy things"?

      • starshadowx2 17 hours ago
        It sounds like you either didn't play enough or you are missing the new mechanics that get added over time. There's definitely more to it than just regular snake.
      • kesor 12 hours ago
        You didn't play long enough. There are layers and layers and layers of features in that game if you play for 10 minutes or more.
      • vunderba 17 hours ago
        I had the exact same thought. To me, it feels like they just took the fairly common “sentient video game character” trope and bolted it onto a very conventional snake game.

        I will say, the act of eating creates a "bulge distortion" that flows down the length of the snake is a nice touch though.

    • spicyusername 14 hours ago

          the quality of produced code and the medium
      
      A thought I have been tossing around in my head as the models get better is that it really may not matter what the code looks like.

      If the observed behavior of the software is good, then the software is good. If a bug, of whatever kind, can be fixed by a model on a vibe-coded codebase, then that's a fixable bug. If there are no exploitable vulnerabilities, then the code is secure. If the performance is adequate, then the code is performant.

      It simply does not matter what the code looks like if, from the outside, it does what its supposed to, and, from the inside, a model can fix the issue if one is found.

      More than ever, software engineering is now really a job about making sure the code is doing what its supposed to.

      And even if it DOES matter what the code looks like, you can have a model fix that too.

      • eithed 3 hours ago
        Don't forget that LLMs are trained on human code. If they cannot understand what your code does then they cannot make changes to it, or at least - having them understand your codebase becomes expensive (more trips to Anthropic servers)
      • skydhash 12 hours ago
        The thing is that a lot of code rely on multiple layers of abstractions with their own correctness and failure states. And then you overlay the domain correctness and failure cases on top of that.

        But all of those correctness are imaginary. The hardware only enforce a few (and it may be buggy). The OS adds some more (and it’s buggy). The compiler/interpreter may have bugs (but that’s rarely a nuisance) and the libraries are often brittle. There are cracks everywhere in the tower of abstractions.

        The code has never mattered. What has always mattered is the knowledge of what is the model of correctness of the software (programming as a theory by NauR), so that you can discern where a program is wrong.

        The thing is a crash or some other immediate errors are actually nice to have. You get to react immediately and can have a core dump or a stacktrace that points you the error. What is truly a terror is silent corruption (wrong order of operations, wrong values for a comparison that has expanded the idea of correctness, security issues that has been backdoored for years,…).

        As Hoare said:

          There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
          The first method is far more difficult.
        
        LLM are very much the second kind. You write a lot of complicated code, and then you can no longer reason about their correctness.
        • gofreddygo 9 hours ago
          > There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

          That is so real. Brilliant !

    • hypfer 19 hours ago
      Being the first to release an article gives you great SEO or whatever. Doing the things you've mentioned takes time.
    • andai 10 hours ago
      These days it's uneconomical for human to verify AI generated code. So we ask the AI to do it. Like when we asked the FBI to audit itself and they found no problems :)
    • jstummbillig 19 hours ago
      Less fascinating when you consider that this is a non-coders perspective.
      • CobrastanJorji 9 hours ago
        It's still fascinating, but for a different reason. The "Concord" tool that got created bills itself as "Instrument-grade measurement of qualitative text. Explore in minutes, publish with honest statistics." Instrument-grade! How wonderful! That presumably means its accuracy has been ensured, and it's been carefully calibrated, right? What, nobody's ever measured or even examined the code? Well, no matter, let's go ahead and publish it and advertise it as "honest" "instrument-grade measurements."
      • eithed 18 hours ago
        Fair enough, but enterpreunership should, I guess, ask questions if given Next Big Thing has substance behind it or is it just snake oil.
        • munk-a 18 hours ago
          Ah, but billions of dollars depend on those questions not being asked in a genuine manner. Don't you want a slice of that or are you an... AI skeptic thunder clashes.
      • unholiness 18 hours ago
        Yeah, this made it basically clickbait for me, in terms of time I wasted with the wrong expectation.

        The lack of downvotes on posts on HN has always felt like more of a bug than a feature to me.

      • nomel 18 hours ago
        So, the perspective of the one that gains the most, that will value this the most, and that will pay the most? ;)
    • chickensong 16 hours ago
      You probably don't care about the ingredients or engineering of asphalt, only if the road does its job well or is filled with potholes. Outside of the software industry, nobody gives a shit about code or databases.
      • geraneum 8 hours ago
        > You probably don't care about the ingredients or engineering of asphalt

        Everyone does. You don’t think about it everyday because we’ve delegated it to experts which don’t come up with a new composition of Asphalt every time you press “generate”. It’s rigorously battle tested and short of intentional negligence, it’s consistent. I’m amazed how people are forgetting how the world actually works.

        • eithed 7 hours ago
          Exactly - the normalization of craft (?) is interesting
        • chickensong 7 hours ago
          You've missed the point.
          • geraneum 7 hours ago
            The point doesn’t seem to have been thought through.
            • munksbeer 5 hours ago
              The point is, if road engineers changed their process and materials, and to you it felt like driving on the same road, with the same wear and tear and potholes, you wouldn't even notice.

              If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour, no-one outside of software engineers will know or care.

              • mitxela 2 hours ago
                But they don't. LLMs can't understand messy code much better than humans can. Maybe a little, but not enough to compensate for the code they create being messy.
              • skydhash 3 hours ago
                > The point is, if road engineers changed their process and materials,

                They do those in labs, and then studies are made to prove that it can replace the current composition. They do not invent those on the spot and let the drivers QA the road.

                > If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour

                It’s on you to prove that this big “if” can be realized. A -> B only matters when A is true.

                • munksbeer 1 hour ago
                  > It’s on you to prove that this big “if” can be realized. A -> B only matters when A is true

                  Not really. This is a discussion about what code looks like if AI can write applications that are as good, stable, correct as humans.

                  I think they can, better than most programmers at the moment, with the correct guardrails and supervision. But in time, I think we may not need to review the code at all, but instead verify correctness and performance only. The AI can write the code however it likes.

                  Obviously I don't have a proof for this, but based on the progress I've seen so far, if someone forced me to bet one way or the other, this is what I'd bet on.

      • eithed 16 hours ago
        I agree. But if I'm paying for the road (even as a taxpayer) I get angry that after a year it's full of potholes and that there are unnecessary signs warning about penguin crossing, making it cost 2 times more than it should have (and dont get me started why this road is really a highway leading to my house). I'd want certain qualities. And this article is basically = you will get a road, built quickly

        But yes, you are right - I don't build roads and don't know what is a price to build a road and how to determine the quality of correctly built one, nor I will ever care or learn.

        • aix1 10 hours ago
          > And this article is basically = you will get a road, built quickly

          That's not how I am reading it. You will get a road built exactly to your spec, quickly. So no penguin crossings unless you ask for them.

          I am also not entirely sure how the pothole argument translates.

          • eithed 9 hours ago
            The road will be built to some specs, including features nobody asked for. If the corpus was trained for roads built in Arctic, you will get penguin crossings.
      • Tylerian 14 hours ago
        The ingredients and composition of the tarmac is the difference between having the road full of pot holes after a week of use
      • fwip 12 hours ago
        Sure, but if there's a trillion dollar company saying that it's going to replace all our road workers or engineers - I'd want to listen to the opinion of an expert. Some reporter from CNN driving over it like "yeah seems good to me, good this" has approximately zero persuasive power to me.
    • sexylinux 7 hours ago
      It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

      So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

      If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

      • fisf 3 hours ago
        By that measure, most software developers should be unemployed.
    • markoloko 11 hours ago
      So would you be more comfortable if the user them just prompted the AI to use a specific language, framework and database. Aren't we all just going to reddit and finding out what all goes best with what? But also I don't trust nothing from it, even though I've seen it.
    • jimbokun 17 hours ago
      Does it matter to the people requesting the software if it acts in the way they expect?
      • crystal_revenge 12 hours ago
        We've lived in a software bubble for so long, most software engineers have completely forgotten that the purpose of (most) software is to solve a problem. If that problem solves the problem well and reliably it doesn't matter the quality of the code.

        In fact, that's the entire reason we care about "quality code", because we assume that quality code is code that does what you expect well and consistently.

        I say this as someone who hand writes code pretty much every night for fun, just to experiment with computation. Which, oddly, is more fun than ever because I don't feel like there's any need to connect this type of programming with "real world software", and I can really enjoy code for it's own sake, meanwhile my job is mostly just running agent loops (which I quite like as well).

        • munksbeer 5 hours ago
          Exactly. Quality of code is a programming invention to make it easier to write and maintain correctly functioning applications.

          That is the entire purpose of "quality of code".

          If the end user experiences a correctly performing application, now, and in the future, they don't care at all what the code looks like.

          AIs could resort to a single global array of primitives and forget all about functions, and just use gotos if it helped them (it probably doesn't).

        • SpicyLemonZest 11 hours ago
          I haven't forgotten that, I affirmatively think it's false. High quality code is necessary to solve problems reliably. Perhaps some people call things code quality when they don't matter (I really don't care what most variables are named), but there have always been teams who try to increase velocity by disregarding code quality, and from what I've seen AI does not stop them from shipping outages constantly.
      • eithed 16 hours ago
        True, but you should say that about every thing. Does it matter to you how the car drives, as long as it takes you to your destination? Well, yes, it matters: how will it deal with a crash, and if it's possible to replace a part and if anybody can just open it if you leave it outside. I will be amazed if somebody shows me their home-printed car, but if they'll try to sell it to me like a new one...
    • grafporno 18 hours ago
      It's an ad.
    • otabdeveloper4 8 hours ago
      Don't harsh my vibes, man.
    • jknoepfler 1 hour ago
      There also isn't any meaningful articulation of why this is a "leap forward"... literally everything claimed in the article has been claimed in the same breathless tones in articles written a year prior.

      I get that there's little sense in arguing with the MBA hivemind, but... c'mon.

      I manage two teams of highly motivated, largely pro-AI engineers. Both teams have independently concluded that they needed to ramp down GenAI usage because of code quality / maintainability concerns. Both teams have suffered from protracted outages caused by LLM jank not being sufficiently fenced off and guarded against. Both teams have expressed concern that the code generated by LLMs is far too verbose, full of slop, and rapidly becomes an unmaintainable mess.

      These are teams that are building non-trivial LLM solutions (deep agentic data synthesis and multi-modal data tagging). They are using the technology creatively and pro-actively, not just vibe-coding slop and throwing their hands up when it fails. Both teams will continue using GenAI coding agents, don't get me wrong - but the gains are incremental, not transformative, and need careful fencing to make sustainable.

      Nothing in these articles resonates as real. People who work in reality don't agree. I don't understand why this shit keeps getting attention (or rather I do, but the reasons aren't good).

    • danlugo92 2 hours ago
      You can either adapt or survive man, coping and negation dont help, AI is here to stay and yes it does require pilots but this map would have taken you weeks to do, the AI did it in 10 hours, you can still dedicate a week to refactor.

      Also this is easily solved by .md spec files, this whole "bad code" cope is just FUD'

    • kordlessagain 21 minutes ago
      [dead]
    • adamtaylor_13 19 hours ago
      I'm becoming more convinced these are questions of the Before Times. Yes, yes—heresy, I know.

      Yet, I can't deny the reality that I observe working with LLMs every day. If this truly is a step-function (as some are sgguesting), then I have absolutely zero concern for the quality of the code.

      • fwip 12 hours ago
        Kind of a circular argument, isn't it? "Some people are saying it's very good at coding. If that's true, I don't care if the code is good."
        • adamtaylor_13 2 hours ago
          I didn't say I don't care if the code is good.

          I said I had zero concern for the quality of the code. That is, I do not have concern that the quality of the code will be a concern in and of itself.

          It's a subtle, but IMO important difference. We only care about code quality so as it gives us stable, understandable systems. Historically that meant a human had to read and understand it. Suppose a future where that's no longer the case, then we may still end up with stable, understandable systems without understanding every minutiae of the substrate. It's the same way I don't really know if my compiler is correct, but the behavioral patterns of my code suggest it is without me understanding anything about its code quality.

  • anonzzzies 1 hour ago
    Been working on my pet project today with Fable; it seems pretty solid but not too far removed from 4.8; same hallucinating, same type of bugs, same focus in large projects on just doing what you ask and just ignoring whatever that may touch/break/influence. Running tests in the beginning but when fuller context, just 'will run later' and never doing it in the end unless you tell it to (using some assorted swear words). I will keep using it but it's incremental as far as i'm seeing, not the OMG OMG OMG Mythos is here!
    • LogicFailsMe 1 hour ago
      It clearly saw things immediately that 4.8 had missed on my projects. But shortly thereafter, having step functioned past those issues and impressed the crap out of me doing so, it got stuck in the usual endless loop talking about stuff more than doing stuff, occasionally deciding to pause so I'd have to whack it to get it going again.

      So nope, not the AGI. But definitely an improvement.

      • justinclift 25 minutes ago
        > it got stuck in the usual endless loop talking about stuff more than doing stuff

        That's the kind of behaviour I've seen in Claude Code (Opus 4.8) when it's context space is over the 40-50% range.

        I tend to keep an eye on the context usage (ie `/context`) quite a lot, and generally see good results as long as the context usage is ~30% or below.

        Which isn't heaps, considering having to ensure it has the required docs/stuff it needs can take 15-20% of context by itself.

  • JumpCrisscross 19 hours ago
    Anecdote: I fed Fable some models I’ve been hand verifying (basically, I sketch out a scenario for Opus to model, it builds it, I ask it to show me the math, I correct it, we iterate like this, then I double check its code to make sure the math matches the model logic). Fable found almost every error I found, and then had some interesting suggestions for additional variables.

    It also burned through my usage quota like a late-90s Hummer.

    • matheusmoreira 17 hours ago
      > It also burned through my usage quota like a late-90s Hummer.

      Yeah. I have a Max 5x subscription and Fable burned through 16% of my weekly quota in a 40 minute code review session. It didn't even finish the review, it switched back to Opus 4.8 in the critical memory safety parts where I actually needed Fable.

      I feel like I'm going to get priced out of these models soon. I should probably try to get the most out of Fable until June 22nd.

    • cyanydeez 19 hours ago
      now for the best question: whats your ROI here?
      • Ferret7446 18 hours ago
        Humans are very expensive, so the equation almost always falls against them.

        It's not just salary, but also safety/labor regulation, legal risk, vacations, sick time, personal conflicts, HR, benefits.

        Even when automation is more expensive on paper, it's generally still cheaper

        • rstuart4133 16 hours ago
          > Humans are very expensive, so the equation almost always falls against them.

          You underestimate what these models cost. Uber's budget is $1,500/dev/month. I gather that was put in place because the dev's were going through $6,000/dev/month, which Uber decided could not be cost justified.

          Fable costs at least twice as much, or $12,000/dev/month.

          Fable can apparently work for hours without supervision, which means a skilled engineer can now have it working on many tasks concurrently. I would not be at all surprised if they can put a nought or two on that number. If you do that, you are well out of "what a human costs" territory.

          • amarant 5 hours ago
            Not to argue myself out of a job, but I cost around $20k/month, all costs considered(taxes, social fees, PTO, healthcare, benefits). If my efficiency is tripled(which it absolutely is, even before fable) for a mere 6k/month(in reality, 1k is more than enough though), that's ~10x ROI.

            I kinda get why execs are excited

            • popcorncowboy 16 minutes ago
              > If my efficiency is tripled

              Our 401ks turn on this actually being true. Otherwise pop.

          • swiftcoder 5 hours ago
            > You underestimate what these models cost. Uber's budget is $1,500/dev/month

            $1,500/month needs to be contextualised against the fully-loaded cost of a software engineer. Uber's average TC for a US-based software engineer is around $350k, the fully-loaded cost is going to be in the $450k-$500k range. So we're talking around $38k/month for a software engineer.

            $1,500/month isn't even a drop in the bucket. If LLM use lets them shave just one person off a team, that pays for tokens for the next 25 engineers.

          • nearbuy 6 hours ago
            These numbers don't mean anything without a denominator. You could burn $10 million/month of tokens if you want. We want to know how the cost per unit of useful output compares to a human. Does $6000 of usage buy you a man-month of work? Less? More?
          • winwang 8 hours ago
            Minor note, 2x $/tok is not 2x cost. Personally, I see Fable being significantly more token-efficient than Opus 4.8. Then, there's also the compounding costs of quality.
            • squidbeak 5 hours ago
              On top of which, as the article mentions, it delegates simpler tasks to cheaper models.
          • ramesh31 11 hours ago
            > I would not be at all surprised if they can put a nought or two on that number.

            People keep saying this and it keeps not happening.

            ChatGPT Pro was $200/mo when it launched in '23 for a ~100B class model with 8k context. Claude Max is now the same price for practically unlimited access to a ~1T class model with 1M context.

            Moore's Law never died, it just switched architectures.

            • swiftcoder 4 hours ago
              Not to mention, the cost/performance of the baseline keeps falling. The cost-effective Deepseek V4 Flash is better than frontier models from a year ago, at a fraction of the cost.
        • TheOtherHobbes 17 hours ago
          Good to know that LLMs will be removing all regulatory and legal risks, as well as creating a consumer economy that no longer employs or pays consumers.

          I can't help thinking there might be some kind of strategic issue here.

          Perhaps someone should ask Mythos about it.

          • Ferret7446 13 hours ago
            By the point where we have work hours regulation for AI, all of our current debate about AI will be long irrelevant because we've clearly achieved AGI
        • gopher_space 16 hours ago
          One of the large (and enjoyable IMHO) challenges in this line of work is developing a de facto understanding of your process and the context it's in service to, and that's only possible if you're actually on your industry equivalent of a "shop floor" for each domain the project touches.

          As far as I can tell this part of the job isn't really on anyone's radar anymore.

        • mitxela 2 hours ago
          Since the recent re-pricing many companies are finding AI coding to be more expensive than human coding.
        • warkdarrior 18 hours ago
          That's the beauty of these AI advancements. You, a human, will have to compete against a model for the same job.

          If you get $100,000 per year as a SWE, and Anthropic offers a coding model for $100,000 per year (but working 24/7), then you'll have to give up all of those addons that make the fully burdened cost of the employee. Say goodbye to vacation, sick time, benefits, etc.

          • rspeele 12 hours ago
            > "What have you got against machines?" said Buck.

            > "They're slaves."

            > "Well, what the heck," said Buck. "I mean, they aren't people. They don't suffer. They don't mind working."

            > "No. But they compete with people."

            > "That's a pretty good thing, isn't it--considering what a sloppy job most people do of anything?"

            > "Anybody that competes with slaves becomes a slave," said Harrison thickly, and he left.

            Kurt Vonnegut, Player Piano

          • dyauspitr 9 hours ago
            They will do it for far less. Once manufacturing catches up and they have the data centers built out tokens are going to be dirt cheap.
        • LogicFailsMe 1 hour ago
          The real issue IMO is what Richard Sutton alluded to in his talk about GenAI creativity. These models are quite creative at problem solving, but for now, they still need a human to set direction and to occasionally pull them out of doom spirals of churn.

          However, given this model now silently corrupts its own work if it thinks you are up to no good, it's absolutely 100% not Mythos so possibly Mythos is better, but who knows now that the alignment and safety safety people are on the case, inadvertently keeping humans in the loop?

          https://simonwillison.net/2026/Jun/10/if-claude-fable-stops-...

      • crystal_revenge 12 hours ago
        The parent comment is describing a test they ran so they could assess their trust in the model for scenarios they don't have time to fully understand.

        Do you not believe in running tests, evaluations, or experiments at all to better understand your environment?

        The ROI in the case of a positive outcome is the reduced time needed to inspect the results in the future (the entire point of AI is to know what you can trust it on, so you can delegate everything at that level with less oversight). The ROI in the negative case is the tokens not wasted on tasks to ambitious for the model.

      • Qhemlomo 18 hours ago
        It just got released, it shouldn't matter.

        We know this model will be cheaper and faster with time.

        And we have not even reached the timespan/timeframe were we have ASIC style models.

        OpenAI has to do something which will beat Fable otherwise Anthropic won. China currently overtakes cars, pv, batteries and very soon silicon chip making, it has all the incentive to also take over AI.

        • JumpCrisscross 13 hours ago
          > We know this model will be cheaper and faster with time

          Why? Demand for AI compute seems to be increasing faster than new production is due to come online for the foreseeable future, particularly if more-intensive models induce demand.

          • slibhb 1 hour ago
            The cost for a fixed level of intelligence will fall over time. We've already seen this. But the cost of cutting edge models will probably not fall anytime soon, and may increase.

            So I would expect Fable-level intelligence to get cheaper.

        • camillomiller 18 hours ago
          LOL magical thinking
          • Qhemlomo 18 hours ago
            I'm happy to discuss arguments if you want to add any?
            • throw939494555 18 hours ago
              Not OP, but for me, this model will get VERY expensive in 2 weeks. Now it is part of Pro plan, after 22nd it will get excluded and I will pay by token API usage (~10x more expensive).

              I find it good for code reviews.

              • Qhemlomo 7 hours ago
                Yeah my time scale is 'a handful of years' :)
            • Our_Benefactors 18 hours ago
              The only thing they’ve overtaken is arguably batteries, and even that is questionable if the quality is as good as Korean manufacturers. I think it’s more likely that the Chinese chip industry overtaking competitors will remain like nuclear fusion, forever “just 5 years away”
              • Qhemlomo 5 hours ago
                The best batteries are currently from CATL. No one in the industry is doubting this.

                Huawei just showed LogicFolding and have a roadmap for 1.4 nanometer by 2031; SMIC is going for 5nm.

                And all of this WITHOUT EUV.

              • zelphirkalt 6 hours ago
                They mostly have overtaken in cars too. Their EVs are just cheaper, and they have built the infrastructure around it, even in more rural provinces. Building infrastructure is something they excel at anyway.
      • PunchyHamster 19 hours ago
        It will be great when the price of compute/memory drops to normal level!
        • cyanydeez 17 hours ago
          >Sam Altman has signed another Memoranda of Understanding: Buying all SDRAM till the heat death of the universe OR Musk relocates to mars.
  • olafmol 19 hours ago
    This little line from the article scares me: "but a software engineer would iron out the remaining potential bugs that I could not find quickly"

    Every sw dev knows this is a very dangerous, and unrealistic, assumption.

    • bluegatty 8 hours ago
      it's basically a tiny statement that kind of hand waves all the 'actual stuff'.
      • BigJono 7 hours ago
        It's "I did the first/easy 90% now someone else do the second/hard 90%". Same as it ever was.
        • bluegatty 6 hours ago
          We're not supposed to be crude on HN but that's some real Dilbert level stuff right there. Like spit out my coffee laughing, cringing. It's too bad the Dilbert guy seemed to have lost his mind in meta level cynicism (and maybe his legacy as well) and also passed away, because we kind of need him now. Dilbert is almost made more for the AI age than the computing age.
  • economistbob 2 hours ago
    And therein is the problem most perfectly expressed. He prompted that all the data should be real and validated and then simply trusted that it was. That was for a data driven project. People will do that for countless things, even critical things.
    • an0malous 1 hour ago
      I wished I had learned earlier in life how much more I could BS things because no one was going to check
  • ecocentrik 19 hours ago
    Reading the first few paragraphs of what he calls "the most sophisticated academic social science paper I have yet seen from an AI" does not impress as much as I hoped.

    "Posterior beliefs about market demand are purely referencedependent: holding dollars raised constant, they track only performance relative to the founder’s self-chosen goal—jumping half a standard deviation at the threshold, responding steeply for the first ten points past it, and flattening thereafter"

    Humans generally don't verbalize data this way. The summary document is also very fluffy.

  • nstart 9 hours ago
    Desperate to know what the prompt for the poem is. The idea of it felt familiar so I went down the rabbit hole and found: 14 years ago, a poem on reddit [https://www.reddit.com/r/RedditDayOf/comments/tjjw2/may_12_a...] . Nowhere near the length of the one the author shared but the same idea.

    > This is from "The Cyberiad", a collection of science-fiction fairy tales by Polish author Stanislaw Lem ... In one of the stories, a robot constructor named Trurl creates a machine that writes poetry. A jealous rival named Klapaucian challenges the machine to compose "...a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism and in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter s!!"

    And the computer responds with:

    "Seduced, shaggy Samson snored.

    She scissored short. Sorely shorn,

    Soon shackled slave, Samson sighed.

    Silently scheming,

    Sightlessly seeking

    Some savage, spectacular suicide"

    The author had to be referencing this moment in their challenge to Fable/Mythos. I'm curious to know what their exact prompt was.

    • Erwin 5 hours ago
      What's fascinating is that this is the difficulty of English translation -- which uses a different start letter and different words than the Polish one:

        Cyprian cyberotoman, cynik, ceniąc czule
        Czarnej córy cesarskiej cud ciemnego ciała,
        Ciągle cytrą czarował. Czerwieniała cała,
        Cicha, co-dzień czekała, cierpiała, czuwała...
        ... Cyprian ciotkę całuje, cisnąwszy czarnulę!!
      
      
      You can consider the job of a translator as compared to LLM. Both derivative works, working within some constraints but with room for creativity.
    • philipwhiuk 6 hours ago
      > the author had to be referencing this moment in their challenge to Fable/Mythos.

      Or it just swept it up in the training data given Anthropic license Reddit comments.

  • gopalv 21 hours ago
    > It worked for nine and a half hours.

    > Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct

    That's the bit that stuck out to me - that's longer than I would expect to work on a problem in a day or even expect to go back & fix the output of something that has a core reward loop of hours.

    My customers are currently clamoring to push down my agent response times from 85 seconds down to below the 20s mark.

    At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.

    • matneyx 21 hours ago
      In Claude's defense (and I cannot believe I'm defending it), I know no single dev who could create what it did (Concord), from a 19-page design document, in 9.5 working hours.

      We're gonna go back to the days where our bosses ask why we're just sitting around, but instead of saying "compiling," we'll just say, "waiting for Claude."

      • torginus 16 hours ago
        I tried to read the 'design doc' - its slop full of vague platitudes and impressive sounding but impossible to pin down management speak - in short, it's slop, and I still don't really get what its supposed to do exactly.

        It's some prompt engineered AI harness, that guides the AI to create stats after it researches a subject and ingests the data, but I'm not sure what is it that the tool actually does on top of this.

      • giancarlostoro 19 hours ago
        This. I get told things like "you can't build all that on your own?" I've had Claude poop out full feature web apps in under 30 minutes, to a spec. Was it perfect? No, but sometimes even in a simple setup phase you can burn 15 minutes to some obscure setup step that's failing. I cannot just code nonstop at 900WPM or whatever ridiculous speed, and poop out an entire full feature web app, with maybe a few bugs here or there. If you can, come show me, I'll gladly have you race against my Claude prompting capabilities.

        Will Claude's code be perfect in one shot? Probably not, will it get you 80 to 90% of the way there with your chosen design patterns in under a few hours? Absolutely.

        • toss1 17 hours ago
          >>If you can, come show me, I'll gladly have you race against my Claude prompting capabilities.

          Sounds like we've nearly reached in coding the point where Paul Bunyan [0] has his epic competition with the chainsaw... and loses by 1/4" and history forever changes...

          [0]https://www.britannica.com/topic/Paul-Bunyan

        • dyauspitr 9 hours ago
          And honestly, it will get you the rest of the 10-20% with a little bit of yelling at it once it’s done
      • neogodless 21 hours ago
        For the rare uninitiated:

        https://xkcd.com/303/

      • petesergeant 19 hours ago
        Sadly I didn't get very many answers to my Ask HN, "What are you doing during inference?": https://news.ycombinator.com/item?id=47944917
        • ModernMech 19 hours ago
          I alt-tab to a MMO and farm XP.
          • mattbettinson 18 hours ago
            Which one?
            • ModernMech 2 hours ago
              I've been playing Monsters and Memories, basically an Everquest clone.

              https://monstersandmemories.com

              It's in private beta but sometimes they have a public beta, like just last week. They were supposed to have released this month but they pushed back to October.

              Also check out Adrullan Online, it's also an EQ clone but Minecraft voxel style. More like alpha status, they don't seem as far along.

        • magarnicle 10 hours ago
          Drawing.
    • giancarlostoro 19 hours ago
      > At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.

      At this point, pay me significantly more, and I'll do it.

      • warkdarrior 18 hours ago
        > pay me significantly more

        Ha ha, that's how you negotiate yourself out of a job!

        • giancarlostoro 16 hours ago
          Fire me then, I can bring someone else drastically more value with AI tooling.
          • swader999 10 hours ago
            "I can bring your competitors drastically more value with AI tooling"
    • PeterStuer 21 hours ago
      My Opus 4.8 regularly works for 10+minutes on a single non-trivial coding request.
      • ASalazarMX 20 hours ago
        Your Opus 4.8? Is it now usual to refer to LLMs like that?
        • wongarsu 19 hours ago
          Isn't it common to refer to all software like that? "Let my look at my JIRA", "I can't find anything using my Outlook's search function", "My Powerpoint is acting up today", "My browser just crashed" are all sentences I might say during a normal work day
          • calvinmorrison 19 hours ago
            better than "The JIRA" , or "The Google" or "The Spotify"
          • hypfer 19 hours ago
            Depends on the demographic I think. And also tells you surprisingly much about how the brain of person uttering it works.

            There are people that almost feel physical pain if something is unnecessarily incorrect.

            + That if the mental model of something is accurate, it is actually _more_ work to say something that is incorrect than just saying the correct thing.

            • wongarsu 19 hours ago
              In my mental model, "my Outlook" is the outlook instance running on my computer, on my data. My outlook crashed today. Yours might not have crashed. Similarly, my Jira contains tickets about my work, your Jira does not contain those same tickets. That might be technically the same instance on the same SaaS server, but the server I'm routed to accessing my data with my credentials turns it into "my Jira". My Jira is slow. Maybe you are lucky and get routed to a faster server, or your company is self-hosting. Then your Jira might be reasonably fast
              • ASalazarMX 19 hours ago
                This is completely fine, as those are your own installs, but LLMs can't be owned by the users, your Opus is the same Opus as everyone else's, your only difference is the suscription tier to their API.

                If you had your own on-premises LLM, that would indeed be your LLM, and it would make sense to compare it to the on-premises LLMs of other people, as your setup particulars would affect the result.

                • dasyatidprime 18 hours ago
                  The copyright to the Outlook binary isn't owned by the users either, even if they're running it on local hardware. The Opus 4.8 weights are (we assume) the same between users, but the conversation/tooling state is not shared between them by default. I prefer to route around this construction myself, since I do think there's some ontological slippery-slope potential, but from a lexical perspective I think “my” is a perfectly defensible abbreviation in context.
                  • hypfer 18 hours ago
                    > The copyright to the Outlook binary isn't owned by the users either, even if they're running it on local hardware

                    There was a time where one actually bought software to own it.

                    This time is.. actually it is right now. Please leave at once.

              • hypfer 19 hours ago
                Hmm, good point. "My outlook" might actually be correct. Depending on if it is a webapp or the real one running on your device that is.

                Similiar to "My game just crashed".

                Jira otoh is not yours, because it's in the cloud. It might be "my internet connection", "my browser" or "my account" that is having trouble.

                ___

                Hm. "My train got delayed" is interesting in this context. I don't find that offensive. But that also might be because trains don't seek rent the way SaaS does? Not sure.

                I guess trains do not hold me hostage. They might just be a container in which someone does that.

                Jira, cloud LLM inference or similar otoh..

                • ASalazarMX 17 hours ago
                  The "my train" convention is an interesting argument. It's not actually yours, you're buying a train-as-a-service single-use license, and there are tiers to that too.

                  I guess the main difference is that TAAS has many different trains where the experience varies wildly, so it helps to be specific on which train you're licensing; but LLMs are the same product for everyone, and you can't stay with say, ChatGPT 1.0, you get the same choices as everyone else.

            • RugnirViking 18 hours ago
              > tells you surprisingly much about how the brain of person uttering it works

              That's ridiculous. You wouldn't respond to "I went to visit my doctor yesterday" with "but slavery has been illegal since forever!" Similarly it would be foolish to respond to "where should we meet? my place or yours" with "but we both rent!"

        • w4yai 19 hours ago
          You don't have your Opus 4.8 ? I got mine yesterday !
          • ASalazarMX 17 hours ago
            I didn't get mine, but I suspect I might be using yours when I use it.
        • PeterStuer 7 hours ago
          I probably should have used 'Opus 4.8 in my Claude Code configuration'. The model and harnass might be yhe same for everyone, but the .md's, hooks, skills, agents, MCP ... configurations make everyone's setup fairly unique.
        • giancarlostoro 19 hours ago
          That's pretty tame, if you want to be disturbed check out r/MyBoyfriendIsAI
          • throw939494555 18 hours ago
            Or dog lovers. All sorts of licking, anal cleaning... full intimate relationship.
    • hedgehog 20 hours ago
      Work duration is also not that valuable of a measure, you're usually better off defining the process yourself in code and having that delegate chunks of work to the models. The only real issue there is that it's harder to take advantage of the providers' subscription discounts, but on the other hand it's easier to do your own model routing, and there's no way I've seen for the normal chatbots to maintain coherence on streams of work measured in days and weeks.
    • cyanydeez 19 hours ago
      I think we hit the sigmoid back when the QWEN models were released. By properly structuring my project, I can point it at any extension I want and get it going for 30 minutes to extend whatever. It can't effectively do 'god mode' on all the code, but being a mindful observer and code "professional" I don't need more than what a 128GB VRAM needs.

      I'm amazed we're so far into SOTA bloat that the chinese will kill once they start etching silicon with these models.

  • mohsen1 19 hours ago
    I have been using it for less than an hour so take this with a grain of salt of being excited for the new tech.

    In a project like mine (https://github.com/tsz-org/tsz) I am constantly frustrated that models were not doing enough research and were not taking into account other situations. Again and again models would produce code that would fix one thing and break 2 other tests that were "unrelated".

    With Fable it seems like tasks are taking much longer (I have not seen a pull request from Fable sessions yet) but reading the transcription of those sessions I can see how it is doing the right thing by not leaving any stone unturned.

    As the article says, it's hard to communicate this "feeling" about models because it is very project specific but I thought I share

    • anematode 18 hours ago
      Does this not indicate that the project might not be structured in an appropriate way that allows incrementally adding features?
      • layer8 18 hours ago
        In general, sooner or later you need to restructure one thing or another when requirements are changing. Good code lets you reason about a refactoring, and experience tells you when it is necessary or appropriate. Coding agents aren’t very good at the latter.
      • mohsen1 18 hours ago
        the setup is solid. there are thousands of tests and CI won't let things to merge if tests are failing.

        But overall, this is pretty normal for compilers to have this sort of "unexpected" tests failing due to some work in an area. It happened to me when I was coding everything manually back in the day too

        • anematode 17 hours ago
          > there are thousands of tests and CI won't let things to merge if tests are failing.

          That's not what a clean setup means... I mean good separation of concerns, established invariants, etc.

          • mohsen1 7 hours ago
            A compiler and type checker is very special case where you can fix something in the lexer or parser and break another thing in AST walker etc. tsz is well architected but those things can happen if you're not careful and that's precisely what I meant in my original comment. Fable can think how changing parser can impact checker etc...
    • nxmxksisksnssb 19 hours ago
      [dead]
  • selfawareMammal 19 hours ago
    What are people working on that they see such a substantial difference between Mythos and Opus? I'd say I'm working with advanced stuff and more than often Deepseek is even more than enough. Why is everybody a genius in here?
    • jenniferhooley 18 hours ago
      Just depends what you are working on. If you are trying to make a video game that's at a level of a decent indie game (think Hades/Baazar/etc), making UI elements/VFX/complex shaders/etc that are organic/interactive/animated that don't feel like a little dogshit vibeslop web-game, then none of the models are even close to good enough to get it done easily. Huge percentage of problems in top 3% games is really hard for any of the models to do with simple prompting.

      Personally I don't really care, because I like coding and learning myself and DeepSeek Flash is all I really care about. But it's really easy to have a ton of benchmarks where the top models can't get anywhere close - and I like to test them on these problems to see how good they are getting.

      Fable 5 is def a little better than 4.8 btw.

    • mervz 19 hours ago
      We see the same thing when new laptops are announced and every employee all of a sudden needs to upgrade, despite the fact that 90% of people would be able to make do with a Macbook Neo.
      • Our_Benefactors 18 hours ago
        > despite the fact that 90% of people would be able to make do with a Macbook Neo.

        Myth. Total myth! I recently had to beg for more RAM after continually hitting swap space which causes tools like dictation to stop working, failure to load certain websites without rebooting, and so on. Devs do in fact need powerful machines and the ~$500-1000 an employer saves upfront in machine costs is dwarfed by productivity losses.

        Giving your engineering employees new machines in a 2-year cycle that are between the middle and high end is one of the cheapest ROI decisions that a tech org can make.

        • oarsinsync 17 hours ago
          Surely devs could just uninstall Slack, and get the same combined RAM & productivity boost?
    • matheusmoreira 13 hours ago
      I'm working on my own programming language. I've also been exploring open source projects to contribute to. Maybe something that helps me pivot from hobbyist to professional. If such a thing is even possible in this day and age.

      Fable 5 found quite a few issues Opus 4.8 missed on code review, even though the stupid cybersecurity nonsense downgraded it. I can't tell you more, I only get a single session per 5h window on Max 5x. Only ran two sessions so far.

    • ianm218 19 hours ago
      I’ve been working on implementing some common web infra type projects in Rust lately. Basically trying to use a lot of the great primatives in Rust like rustls (modern openSSL) and Tokio (async) to build memory safe or close, nginx drop in replacements.

      A small portion of this effort is having a high quality Lua in Rust repo. I’m using mythos to fix some of the performance issues with my Lua interpreter that gpt 5.5/ opus 4.8 had stone walled on.

      Not sure if Mythos will be able to crack this but it has been running for a couple hours now with some promising results.

      Performance charts linked here if your curious https://github.com/ianm199/lua-rs

      • mplanchard 16 hours ago
        What’s wrong with mlua?
        • ianm218 16 hours ago
          Mlua works for many use cases but is a wrapper around the C code, so you need to bundle C as part of the build. So this is worse for cross compilation and makes it so you can't easily use mlua projects in wasm32-unkown-unknown. An example is that it would be hard to run a game in the browser that exposes Lua scripting with mlua.

          The other reason is that because mlua is just a wrapper around the C code, it has unsafe you can't really get around. So for example Lua is used in Redis, which has this critical CVE https://github.com/redis/redis/security/advisories/GHSA-4789... that a memory safe version of Lua wouldn't have to deal with.

          Mlua is still fine or even better for many other cases though!

          • mplanchard 15 hours ago
            The WASM thing makes sense. Do you need unknown-unknown? Seems like support exists for emscripten and wasi: https://github.com/mlua-rs/mlua/issues/366

            It just seems like a lot of hassle to write a lua interpreter, although it would be nice to see a high quality one in Rust :)

            Hematita was promising, but looks abandoned.

            • ianm218 15 hours ago
              Yeah an example is that currently you can't build Bevy games in the browser with scripting in Lua, so I've gotten a little traction there.

              And yes it seems like there has been many attempts to get a solid Rust Lua over the years and most never reached parity so hoping some people can find use case for it! This one is at full parity in terms of behavior and performance is getting to within striking distance.

              • mplanchard 15 hours ago
                Best of luck! We used mlua at $JOB for scripting support, and it worked great, but we’d have preferred a pure rust solution if one existed with the right performance profile
    • jstummbillig 17 hours ago
      I am sure you would not find it hard to exhaust any model, if you kept upping your ask enough times.

      On the margins, suppose the prompt is literally: "Build a feature complete, high polish Facebook clone". Facebook is complex but likely not super complicated tech, and still I would assume that (after having burned through a substantial amount of tokens) you would find substantial enough differences in the outcomes between different models on that prompt on various fronts.

      The above ask is obviously not useful, but what's preventing you from taking on bigger chunks until you approach the limit? At some point you would hit a boundary, where the diff will be obvious.

    • mohsen1 19 hours ago
      I had a few of the benchmarks left alone and was working on tech debt knowing that a new model is going to be released soon. For my project (tsz.dev) Opus 4.8 was running in circles without producing results for a while for those tasks
  • mieubrisse 4 hours ago
    Having a good prompt-engineering skill is the highest-leverage thing IMO, so I burnt 2 Max 20x usage windows to help Fable help me refactor mine. With its partnership we:

    - Went deep on "what types of guidance even are there? what does giving good guidance mean?"

    - Sampled my existing Claude guidance (CLAUDE.md, skills, hooks, etc.) and broke their guidance into "atoms"

    - Categorized them by clustering, the same way Big Five was generated

    - Generated a new candidate

    - Then used independent agents to compare it against my existing corpus assuming that the new one would be worse

    Working with it felt like working with a supersmart entity capable of generating very plausible-sounding but not-necessarily-true statements. The outcome certainly felt like an alien artifact, like nothing I'd make myself.

    Only time'll tell if it holds up, but it sure had some interesting ideas.

  • theturtletalks 19 hours ago
    This is what he built:

    https://isochronic-passage-chart.netlify.app/

    Doesn’t work too well on mobile but looks interesting

    • jampa 19 hours ago
      It is hallucinating many flights in my region, some that never existed (so it is not an outdated data problem).

      I also see some logic flaws. It overlooks the option of going to a major hub to access faster aircraft, rather than hopping on local hubs.

      Also, immigration and customs are cleared at the first airport you arrive at in the country, not at the last one.

      In some countries, you need to clear immigration even while going to a third country, so 1 hour is not enough to do it.

    • skipants 19 hours ago
      It looks interesting but, like a lot of AI, looks correct but is not. Most of northwestern Canada says you can get there by road. If you look at Google Maps, there's no roads there for quite awhile. I see one highway between Inuvik and Tuktoyaktuk but that's about it.
      • neom 18 hours ago
        Reminds me of a fun story. Some 20 years ago when I moved from Fort Frances to Toronto for college, my high school best friend was also going to college in Toronto, and his dad offered to drive us together in his truck with all our stuff in the back. We were saying our goodbyes and my buddies dad said to my dad "We'll get there a lot faster, I found a shortcut!" My dad, confused says "shortcut? there is no shortcut, just highway 1..." and his dad insists he found an alternative route, much shorter by kms and we'll fly up there 6 hours faster! Get into the truck and he pulls out 5 pages of printed mapquest... I assure you, having done it, Sault Ste. Marie to Sudbury via Elliot Lake on logging roads, may look interesting, but not correct, added a good 8 hours to the trip.
    • KeplerBoy 1 hour ago
      It is cool, but still weird that it get's very basic stuff wrong like mapping the cursor coordinate to the canvas. There's some y-axis scaling issue.
    • rgmerk 10 hours ago
      It put the chart title directly on top of Australia.

      Which just about sums up my experience with using LLMs to code, really (though not with these state-of-the-art models, admittedly) - it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.

      • justinclift 18 minutes ago
        > it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.

        Yeah, the whole "can run for 9 hours on a task" to me is not a positive.

        I tend to find if Opus 4.8 runs for ~15 mins on a task, then the end result has gone off in a weird direction at some point, and it needs winding back a fair bit.

        And that's with extremely clear direction, literal specification docs to follow, etc.

        That being said, having functional code already created beforehand (ie by a human) goes a long way to ensuring the AI model has a path it can build on without making too many dumb architectural choices by itself. Generally.

      • alt227 6 hours ago
        I believe thats why they put 'Sydney' as an option at the top to recenter the map.

        The real issue with the title is that it doesnt fit in the box!

    • ImaCake 11 hours ago
      It's fun and it looks good regardless of whether its 100% correct (It would certainly take me more than 9 hours of work to do better than this). Making these bespoke tools possible for most people is a big deal.
      • aix1 10 hours ago
        The UI is full of glitches: the legend that's placed right on top of Australia, the title that doesn't fit in the box, the crosshair that doesn't accurate track the cursor, the pixellated fonts along the perimeter, the unreadable colour combinations in the overlay, the rendering glitches along the axes when you flip from tab to tab and so on and so forth.

        It's like someone took a beatiful, intricate piece of vintage jewellry and made a slapdash imitation out of cheap plastic.

        • bschwindHN 9 hours ago
          Yep. People are creating garbage with AI that looks passable at first glance, or maybe acceptable if you have no taste. This is the kind of software we can expect to receive in the next few years.
    • endymion-light 2 hours ago
      Doesn't work too well on desktop either! This is decent but it's also an early hackathon set-up - this is something that you can set up on a sonnet model fairly easily (without the weird CSS slop that anthropic models seem to love).

      I'm not very threatened by this if this is the dangerous Mythos model - it just seems like a slightly incrementally better sonnet

  • thepasch 19 hours ago
    What it feels like to work with Fable:

    > Switched to Opus 4.8: Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.

    • matheusmoreira 17 hours ago
      Same experience here. The parts of my project that actually could have benefited from Fable's code review got this instead.
  • kgeist 2 hours ago
    Judging by the benchmarks on Artificial Analysis, "a very real leap over every model" is 2-3 points over competitors (say, 62 for Fable 5 vs. 59 for ChatGPT 5.5 xhigh for coding).
  • michaelteter 7 hours ago
    As a software engineer and solution provider, I do not feel threatened by this.

    I do not fear that management will get tools like Mythos and then not need people like me. Most of the value I provide is in translating what the management/client _thinks_ they need into what is the real problem and solution.

    That's not an insult to them, it's just pointing out that they see only their problem, and they imagine what would be the solution. They then ask for that solution. Quite often, what they want built isn't what they need. And I've seen so many problems, from so many domains and scenarios, that I can usually recognize the core need and propose (and build or direct building of) a solution which resolves that need AND has an eye toward the likely future needs.

    Mythos may do an excellent job providing a high quality result based on what is asked of it. But the result will only be as good as the quality, clarity, and presentation of the request.

    If I hire a home builder to build me a custom home, that builder is going to ask me a thousand questions - questions I had never even thought of. Mythos isn't going to ask all those questions - it's going to make the best choices it can without the consultant's level of interaction. And the buyer will get what they get. Sure, the buyer can then say, "oh, I don't want any hallways - just connected spaces." Then the house gets demolished and rebuilt to the new, clearer spec. Repeat, repeat repeat. Maybe eventually the buyer gets what they really want. More likely they give up before reaching that point, and they go and hire a real builder.

    I'll sum it up like this: You can get great results with minimal effort if you don't really care too much about the details. But if you don't care much about the details, then your need probably wasn't very significant.

    • redhale 4 hours ago
      I will never stop being fascinated with takes like this. Maybe you're right. But people said very similar things over the last few years and many of those statements look unbelievably naive in retrospect.

      Sure, AI can auto-complete the line, but it can't write full functions.

      Sure, AI can write functions, but it can't complete full features.

      Sure, AI can write full features, but it can't build full applications.

      Sure, AI can write full applications, but it can't build them in the right way / ask the right questions / write beautiful maintainable code / do what _I_ do..

      Time will tell.

      • ChrisLTD 2 hours ago
        Even the early versions of AI autocomplete tools like Tabnine and the original Copilot could autocomplete entire functions, so I think you might be strawmanning a bit.
    • zelphirkalt 6 hours ago
      I currently see the problem as follows: The knowledge worker like you sees the need for people like themselves to still be hired, and can reasonably argue for it. However, the management dudes and investors do not understand it, and it is difficult to make them understand, when their (short to medium term) profits depend on not understanding it. So whether you feel threatened or not, is just a matter of you feeling bad or not, but doesn't really matter, when it comes to finding a job.
  • neaden 19 hours ago
    Man, that poem it made is terrible. Like just incredibly bad. Sure it's neat that software can make an incredibly bad poem but there is enough bad poetry in the world that we don't need it.
    • Kiro 17 hours ago
      How good can a rhyming poem about a haircut where every word starts with S be?
      • endymion-light 2 hours ago
        Seduced, shaggy Samson snored.

        She scissored short. Sorely shorn,

        Soon shackled slave, Samson sighed.

        Silently scheming,

        Sightlessly seeking

        Some savage, spectacular suicide.

        - That's the translated Cyberiad Poem the blog post based it off off (or the AI decided to do so)

      • electroweak 9 hours ago
        A whole lot better when written by a human, such as Michael Kandel. This was one of the tests of the electrobard in a story from the Cyberiad ("fables for the cybernetic age"). The key point about Samson was his suicide, which despite the obvious isn't mentioned in the six pages of this rubbish. Perhaps guardrails are throttling this corporate "fable"s ability to comment on the human condition.

        The poem Kandel translated from the original Polish was, for artistic reasons, completely different. I will be impressed when machine translation can duplicate that!

    • layer8 18 hours ago
      I wonder what Vogons would think of it.
    • throw310822 4 hours ago
      Terrible? Incredibly bad? Something tells me you are not very familiar with poetry, literature or writing in general. This exercise gets its inspiration and tone from one of Stanislaw Lem's Cyberiad short stories ("Trurl's electronic bard"). Besides, what did you expect from a "10 pages epic rhyming poem about a haircut where every word starts with the letter S"? Robert Frost?
  • wxw 18 hours ago
    I am… underwhelmed by the artifacts in the post.

    I don’t see why working longer is a pro. The results don’t seem much better than you’d get from putting Opus in a long loop.

    • warkdarrior 18 hours ago
      > The results don’t seem much better than you’d get from putting Opus in a long loop.

      Care to share the results you got from Opus working on the same prompt? It should be easy to compare quality.

  • Ameo 10 hours ago
    Most depressing thing I've read in weeks, and that's a high bar. Hooray to humanity for creating the thing which has destroyed all the value of of being good at creating things.
  • jgilias 3 hours ago
    Cool. But.

    Most of the “impressive” stuff is not “the model” but “the harness”. Spinning up the subagents and teams of lower models, letting them explore, do adversarial coding. It’s all in the harness. Granted, Mythos might be better at that orchestration, but it’s still the harness.

    Second is the prompting. The author is an expert in what they’re doing and prompts the system in a way that yields useful results. I see too many people believing that if an expert can achieve those results in a domain they’re familiar with, then them as non-experts will be able to as well. And that’s a fallacy that Mythos doesn’t change.

  • asdK120 21 hours ago
    Mollick runs the Generative AI Lab at Wharton, with all the corporate sponsors.

    He is a professor but sadly also an AI shill. He should switch to advertising washing power.

    • MostlyStable 21 hours ago
      So...no engagement with the substance? Not even to explain why it is that this is not a useful description or test of capabilities? Ok.
      • dthread3 21 hours ago
        I would like to see it do something useful, like converting pytorch to golang.
        • Philpax 17 hours ago
          Will you accept a port of Torch to Rust? https://github.com/forecast-bio/ferrotorch
        • lijok 20 hours ago
          Hot damn - is that the floor of what you consider useful?
        • fdsdfsdfzxczxc 21 hours ago
          This newfangled car thing is useless. It can't even properly shoe a horse.
        • cadamsdotcom 20 hours ago
          Why not get a plan from Anthropic and get that done yourself? Probably is going to cost you as much as a coffee.
    • CuriouslyC 19 hours ago
      Ethan is a booster but I wouldn't call him a shill. He cites data and mostly in a fair way, though you could argue the sources he chooses to focus on are biased.
    • whyenot 21 hours ago
      Instead of attacking the author, please respond to the content of the article. That is the HN way, and it leads to more substantive and interesting discussions.
  • xavierforge 8 hours ago
    No question the capability jump is real, but in my experience it correlates with shortcut-taking. Fable 5 (and Opus 4.8 before it) hallucinates more than any Claude model I’ve used. The most common failure mode is asking it to modify existing code and watching it skip reading the original file, reconstruct that section from imagination, and then apply edits on top of its own invention, even with full context provided.

    Maybe my prompts are too vague, but it’s worth noting that every example in the post is a greenfield build, and vague prompting seems to hold up fine when there are no existing constraints to respect.

  • dmzxnico 5 hours ago
    Probably just a model that was trained on high code bases, tuned to find security breaches and bugs by being "smart" enough to actually test the code by itself / manually going through the app / website feels easy for Fable so Mythos is just a better version.
  • shadytrees 6 hours ago
    The Balatro game that Fable spit out (Flipside) https://play-flipside.netlify.app/ is buggy but fun. Fable also fixed one of my personal pet peeves. Unlike Balatro, it comes with a calculator to preview the score!
  • SupremumLimit 15 hours ago
    I took a brief look at the code for one of the projects (https://github.com/emollick/concord/) he breathlessly praises and says "a software engineer would iron out the remaining potential bugs that I could not find quickly". The code looks like an unmaintainable mess.

    Other commenters have pointed out that his isochrone map contains a lot of nonsense as well.

    So the most charitable interpretation here is that this is a case of Gell-Mann amnesia.

  • mjamesaustin 18 hours ago
    The snake game is legit very fun. Once I got the ability to pick up the apples and plant apple trees, I was sold.
  • kleiba2 2 hours ago
    > So I asked Fable to solve the problem, first generating a complex 19 page design document and then executing it.

    > It worked for nine and a half hours.

    And how much did that cost?

  • pu_pe 18 hours ago
    The isochrone maps are quite beautiful [1], and go beyond the scope and refinement of some earlier human attempts I could find [2][3][4].

    [1] https://isochronic-passage-chart.netlify.app/

    [2] https://mapitout.welcome-to-nl.nl/

    [3] https://commutetimemap.com/

    [4] https://andrewding.ca/flightisochrones/

  • mawadev 19 hours ago
    Isn't it weird that we started to gauge the quality of a model by checking the vibe of the vibe coding?
    • geraneum 8 hours ago
      You can see this all over the place. Under the Fable post in HN, you have simonw talking about the “feel” of working with Fable and how much better it is. If I believed in conspiracies, I’d have said it’s all orchestrated marketing…
  • recursivedoubts 20 hours ago
    would it be possible for mythos to make the space bar scroll the pages on your website properly?
    • mulr00ney 19 hours ago
      Seems to be hijacked the video of some game they generated. :(
      • albedoa 18 hours ago
        If you delete the video from the DOM, then click back into the content area, it reattaches the video lol.
  • queenkjuul 21 minutes ago
    not only is the site completely unusable on mobile ootb, but when i enable desktop mode on Android, my taps are detected in the wrong spot--clicking Chicago registers as Saskatoon.

    At first i thought its routing was just completely botched.

    The text overflow on the legend is pretty funny considering how well the other graphics turned out

  • 382hi 21 hours ago
    I think Qwen 3.7-Plus is better at reasoning than Mythos, and I've used both for quite a while.
    • giancarlostoro 19 hours ago
      Would love to see samples of the kinds of prompts you use with both. I sometimes wonder if the specific wording is the secret sauce, I have very few issues with Opus / Claude, but when I try premier GPT models, I get weird output from what I've grown to expect with Claude.
  • vb-8448 18 hours ago
    Nice, but I'm really curious about how many tokens have been used.

    There is only one hint: 475k tokens in the screenshot when OP asked the model to fix some behaviour, but it would be fascinating to know the total tokens amount.

  • Aperocky 19 hours ago
    > This is a map that shows the distance you can travel in a given length of time, and the first one was created in 1881 showing travel times from London.

    The first item on the article, the first thing it showed, was wrong though.

    It is 100% faster to go from London to New York in 1881 than Volgagrad. Or any of the Russian hinterland colored green or Turkey or Egypt.

    • patcon 19 hours ago
      > faster to go from London to New York in 1881 than Volgagrad

      the map is for 2026, yeah?

      • Aperocky 15 hours ago
        yeah the original map was not for this purpose. Though I would say there are heavy assumption made for 2026 too, namely the flights are available immediately upon demand.
  • philipswood 11 hours ago
    > It also created a 10-page epic rhyming poem about a haircut where every word starts with the letter s

    Wow

  • ElijahLynn 16 hours ago
    Loved the article!

    And I'm excited to try it, but also have a fear that I will like it too much and then won't have access to it in 2 weeks... But maybe I will and maybe it will be worth it and I'll just pay a bunch of extra for it and it'll be great!

    I think the article could be improved by actually sharing more feelings. I clicked on the article for feelings but I didn't see that many feelings described.

  • ComplexSystems 17 hours ago
    Who can afford to use this damn thing though? They're pricing everyone out of the market with stuff like this.
  • root_axis 21 hours ago
    I just can't stand this type of fawning language.
  • brockVond2021 5 hours ago
    on the places I've checked, mostly Paris to places in Ireland or Britain, the times are off by an order of magnitude

    looks nice but deeply flawed

    classic LLM output

  • 12345hn6789 2 hours ago
    The coin flip game does not work. I tossed 2 coins and it broke after that. You cannot progress forward.

    Not a great start for "a generational leap in model effectiveness"

  • catigula 18 hours ago
    >Ethan Mollick

    Just an FYI this guy is an AI hype-beast. Some of his tweets are truly out there.

    • dogmayor 18 hours ago
      Huge fanboy for sure
  • steve1977 18 hours ago
    > it is indicative of AI solving a hard problem involving research, math, visual development, taste, judgement, complex coding, and more.

    Is it a hard problem or is it just labor intensive?

    • warkdarrior 18 hours ago
      Depends on the skill of the person working on it.
  • ElijahLynn 17 hours ago
    > The work has shifted from process to outcome. I no longer steer; I commission.
  • PaulHoule 18 hours ago
    My wife likes to say "feelings aren't facts"
  • LogicFailsMe 18 hours ago
    I'm using Fable this afternoon and it's definitely a step up from Opus 4.8, finding and fixing things Opus 4.8 was blind to even perceiving. The next 13 days are going to be fun IMO. And Opus 4.8 was less annoying than Opus 4.7 FWIW.

    Edit: A couple hours in and I just got my first gaslighting attempt from the model. Good times!

  • philipwhiuk 6 hours ago
    Given that token counts are easily available not providing how much any of his examples cost is lunacy.
  • the_doctah 21 hours ago
    More Mythos Marketing.
    • boringg 19 hours ago
      The mythos of Mythos is marketing.
  • ThejaCH 19 hours ago
    What it feels like to work with Mythos? Feels like am poor
  • zb3 19 hours ago
    Was the condition of being granted early access to this castrated model writing a post praising it?
  • zuzululu 19 hours ago
    > First, how good is Fable? In experiment after experiment I conducted, it outperformed basically every other public model I have used by a considerable margin.

    What makes me excited is that GPT 5.6 (its actually GPT 6) is going to be crazy

  • nickphx 1 hour ago
    oh look, more overhyped drivel from a non-technical person.
  • younglunaman 18 hours ago
    >What it feels like to work with Mythos >Looks Inside >So I did this with fable...

    What?

    • warkdarrior 18 hours ago
      Fable is Mythos with extra guardrails, so the analysis holds.
      • Chu4eeno 8 hours ago
        Considering all the initial Mythos hype (before they released Fable) was for things that Mythos explicitly can't do, no, not really.
  • honeycrispy 19 hours ago
    Reading it, I can't help but feel he's being paid to write this. Or maybe he hopes to be paid. The language he uses makes him sound like he's fawning over the lost days of his childhood. Pardon me for being skeptical, but a trillion dollar company running a net-loss is hoping to IPO, and needs to sway public opinion by any means necessary. I would imagine that no dirty marketing scheme is off of the table, even from the self-proclaimed "good guys".
  • Andy_Donner 5 hours ago
    [flagged]
  • andrewvu0203 11 hours ago
    [flagged]
  • aryehof 8 hours ago
    [flagged]
  • ath3nd 9 hours ago
    [dead]
  • et-al 21 hours ago
    [flagged]
    • astrange 21 hours ago
      It is not a sponsored article and he writes one of these every time a new model releases. Why would a professor at Wharton need to write sponsored Substack articles.
    • 0x1ceb00da 21 hours ago
      "I don't care who the IRS sends I am not paying taxes!"
  • pbgcp2026 13 hours ago
    So, Ethan Mollick has just broke an NDA he signed. Typical. Out of everyone participating in Project Glasswing it was, of course, the Uni to f*k it up.
    • HDBaseT 11 hours ago
      The model is public and none of the inputs/outputs contained biosecurity or cybersecurity prompts.

      You can do all of this (and more) on Claude Fable 5, in-fact Fable 5 outperforms Mythos in most tasks (where the guardrails don't kick it at least).