"Jarvis! Initiate the House Party Protocol”. Iron Man (Tony Stark) to Jarvis in Iron Man 3 (2013).
Introduction
We’ve all been a fan of the Marvel Cinematic Universe (MCU), especially Iron Man, his armoured suits and all the gizmos that surround him. With the advent of AI in the last few years what seemed like science fiction is slowly and steadily moving towards reality. The "protocol" in the dialogue above was the execution of sophisticated AI-controlled, specialised drones acting as a single, coordinated unit under the command of J.A.R.V.I.S. In case you didn’t know, J.A.R.V.I.S is Just A Rather Very Intelligent System, an Multimodal AI operating system created by Tony Stark for himself.
The most unrealistic thing about Jarvis is his singular loyalty towards its inventor and owner. In this article, I try to separate how multimodal AI is depicted in science fiction vs how it really operates in real world. I also explore a design framework on how it could be shaped to be truly valuable for the individuals that own it and not just try to maximise the profits for the organisations that might build it.
Multimodal AI - Unlike traditional AI that only "reads" text or "sees" images, Multimodal AI processes and integrates multiple types of data simultaneously—text, audio, video, and sensory signals. This allows the system to understand context much like a human does, recognising not just what you say, but the tone of your voice and the environment you are standing in.
The Jarvis We All Want
In the MCU, Tony Stark has Jarvis as his personal AI assistant, agent and operating system that controls all aspects of his powers and how he uses them to save the world. Jarvis is always present with him — in his helmet, his lab, his house, the screens and systems that surround Stark’s life and work. Jarvis sees what Tony sees, hears what he hears, understands context and interprets it in realtime as it unfolds, and augments his abilities without ever competing for attention or extracting value for itself. He is proactive and preempts dangers for Tony, does not advertise, nudge, or optimise engagement. He simply exists to make Tony more capable.
That idea has endured in popular culture for a reason.
As artificial intelligence has steadily moved from research labs into consumer products and enterprise systems, the public conversation has tended to swing between extremes: fears of super-intelligence on one side, and hype about productivity gains on the other. Yet the more consequential shift has been quieter and more incremental. AI is no longer something we consult occasionally. It is increasingly something that accompanies us — across apps, workspaces, devices, interfaces, and moments of decision and action.
We can already see the trajectory that will one day enable a real life Jarvis taking shape. Language models are no longer confined to text. They operate alongside vision, audio, memory, and action. AI Agents are being embedded into operating systems, browsers, workplaces, vehicles, and smart homes. Interaction is becoming multimodal and persistent, rather than episodic and transactional. In a recent interview, Alphabet CEO, Sundar Pichai referred to Multimodal AI as the technology to watch out for in the next decade. The question is no longer whether AI will become more capable. That trend is already well underway.
The issue is what kind of presence it becomes — and whose interests shape that presence.
Jarvis works not just because he is powerful. He is an extension of Stark’s intent. He understands context instantly, interprets nuance in real-time, and preempts danger without ever competing for attention or trying to "optimise engagement." He doesn't nudge, he doesn't advertise, and he doesn't extract value for himself. He simply exists to make his user more capable.
That idea endures because it represents the idealised promise of AI: pure augmentation without hidden costs. As language models evolve to operate alongside vision, audio, and persistent memory, we are technically approaching the capability required for a real-life Jarvis. But capability is not the same as alignment.
The Jarvis Gap
Jarvis is singular by design. He has one user, one context, and one definition of success. There is zero ambiguity about who he serves.
Real-world AI operates under fundamentally different conditions. Today’s most advanced models are built, deployed, and governed by massive private entities—OpenAI, Google, Anthropic, Meta—that exist within fiercely competitive markets. Their incentives aren't inherently malicious, but they are structural. They are driven by the imperatives of growth, market share, capital efficiency, and platform dominance. If an AI ecosystem’s success is ultimately measured by engagement metrics or service revenue, its systems will inevitably evolve to prioritise those outcomes over the deeply individual, nuanced welfare of a single user.
This is what I call the:
“Jarvis Gap”— the immense chasm between an AI that exists to maximise individual human capability, and one that exists within an ecosystem optimised for scale and profit.
We have seen this movie before! Think about social media.
Social media platforms did not begin with the intention to distort public discourse or amplify division. They began with promises of connection, voice, and participation. The harms that followed were not the result of a single moral failure, but of systems steadily optimised for engagement that drives revenue for the private organisation. Over time, those incentives shaped content, behaviour, and even how people interpret reality itself.
The lesson is not that technology inevitably causes harm. The lesson is that systems optimise toward what they are rewarded for, often in ways that are subtle at first and deeply entrenched later.
This distinction matters because multimodal AI changes the nature of influence.
Social platforms primarily mediated information flow. Multimodal AI assistants increasingly mediate perception, prioritisation, and action. They determine what is surfaced, what is backgrounded, what feels urgent, and what can safely be ignored. Influence becomes continuous rather than discrete, ambient rather than explicit. At that level of integration, influence becomes ambient rather than explicit. Trust can no longer just be a brand promise; it must be a structural property. If we cannot rely on corporate altruism to build our Jarvis, we must look to the structural guardrails that define how these systems are allowed to exist.
Governance, Guardrails, and the “Right” Path Forward
So how do we bridge the “Jarvis Gap”? How do we de-risk the evolution and development of multimodal AI that is as capable as the fictional Jarvis?
When discussions turn to AI risk, they often collapse into two approaches:
- innovation must be left untouched so that progress is not slowed, or
- the technology must be tightly constrained to prevent harm.
Both framings miss the point. The real challenge is not whether AI should be governed, but how restraint is architected and enforced in a complex system.
Relying solely on corporate self-regulation in a winner-take-all market is naive. Conversely, expecting slow-moving government bodies to micromanage rapidly evolving technical weights is equally unrealistic. What is required is layered governance—an approach based on the assumption that no single safety mechanism is sufficient on its own.
This is often visualised in high-stakes industries as the "Swiss Cheese Model" of risk management. Every layer of defence—whether it’s a technical constraint, a company policy, or a government regulation—has "holes," or flaws. No single layer is perfect. But when you stack these layers, the holes don't align, creating a resilient barrier against systemic failure. Each layer can fail. The system remains safe only if failure does not cascade. The figure below helps to visualise the concept of the layered defence.
The "Swiss Cheese" Model - Borrowed from aviation and healthcare, this model posits that in any complex system, safety is achieved through multiple redundant layers of defence. Each layer has "holes" (vulnerabilities), but as long as the holes do not align, the risk is stopped before it reaches the human user.
These layers are not hierarchical in the sense of top-down control. They are peer layers, each addressing a different failure mode. A pyramid implies dependency. What is needed here is redundancy.
These layers are already taking shape in the form of laws and frameworks that are defining the "rules of engagement" for multimodal systems.
The Societal Norms Layer
The Regulatory Layer: The EU AI Act
The Framework Layer: The NIST AI Risk Management Framework (RMF)
The Technical Layer
Comparative Safety: Lessons from High-Impact Industries
This level of scrutiny might seem like a burden on innovation, but it is exactly how we handle every other technology that has the power to fundamentally alter human life.
- Aviation: We don't fly because we "trust" the pilot’s intentions. We fly because of a redundant stack of certifications, black-box logging, and international safety standards that ensure a single point of failure (a "hole" in the cheese) doesn't bring down the plane.
- Pharmaceuticals: Before a drug touches a human, it goes through three stages of clinical trials and external peer review. The drug development lifecycle takes years if not decades before safety and efficacy is certified.
- Finance: Banks are governed by strict capital requirements and independent audits. There are numerous checks and balances applied to financial accounts and transactions to ensure safety of an individuals money and prevent money laundering.
These are just a couple of examples of the very many domains where society decided that the technology was too consequential to be governed by market incentives alone. In my opinion, multimodal AI fits the bill.
Jarvis on the Edge - a plausible design approach?
Let’s shift our focus to the technical layer in our multi-layered approach to multimodal AI risk management discussion. This is where governance and architecture intersect. A real-world Jarvis does not need to exist as a single, all-seeing cloud intelligence. A more plausible and more restrained design builds on something already familiar: personal devices.
In this model, the AI’s primary intelligence layer lives close to the individual — distributed across hardware they own and control. Most perception and interpretation happens locally. The system understands context without exporting raw experience. Vision, audio, and behavioural signals are processed on-device wherever possible.
Edge AI - Edge AI refers to processing artificial intelligence algorithms directly on a local device (like a smartphone or a smart watch) rather than in a centralised cloud server. This keeps sensitive data under the user's physical control and significantly reduces latency and privacy risks.
When advanced reasoning is required, the system reaches outward — selectively. Queries are abstracted. Personally identifiable information (PII) is removed. The cloud functions as a reasoning utility rather than a continuous observer. Crucially, this boundary is enforced technically, not contractually.
How Edge Design Fulfils the EU AI Act
The EU AI Act’s biggest concern is the centralisation of power and the risk of "Profiling" without consent. An Edge-First design addresses this by keeping the most sensitive data—the raw multimodal streams—off the cloud.
- Data Minimisation: By processing vision and audio locally, the system adheres to the Data Governance requirements of Article 10. The corporation never sees the raw data; they only see the anonymised "reasoning request" sent to the cloud.
- Human Agency: Because the "brain" lives on hardware the user physically owns (their smartphone or dedicated AI wearable), the Human-in-the-Loop requirement is easier to enforce. The user has physical "kill-switch" control over the sensors, fulfilling the Act's demand for effective oversight.
Meeting NIST RMF Guardrails through Edge Architecture
The NIST RMF places a heavy emphasis on Security and Resilience. In a cloud-centralised model, a single server breach could expose the private lives of millions.
- Reducing the Blast Radius: Edge AI decentralises the risk. If a cloud server is compromised, the hacker doesn't get a live feed into your living room because that feed never left your device. This fulfils the NIST principle of Secure and Resilient systems.
- Privacy-Enhanced Technologies (PETs): Edge design is the ultimate PET. It ensures Autonomy—a core NIST concern—by preventing the "Quiet Misalignment" that happens when a cloud AI is subtly tuned to maximise its owner's profits at the expense of the user's focus.
By keeping the "context" local and the "reasoning" selective, we create a technical barrier that even the most aggressive corporate incentive can't cross. The privacy isn't just a policy in a 50-page Terms of Service; it's enforced by the architecture itself.
This is not speculative. We already see elements of this approach in modern smartphones, wearables, and edge AI systems. Take Apple Intelligence for example, Apple uses powerful on device processing for most tasks that do not require complex processing. Dedicated neural processors exist precisely to move intelligence closer to the user. Another example is Google’s on device Gemini Nano that uses the hardware and can work without the internet. What is missing is not capability, but the intentional prioritisation of human agency over data extraction.
Such systems would not be cheap (yet), nor would they scale as frictionlessly as advertising-funded platforms. But neither did personal computers or smartphones in their early years. Costs fall. Capabilities improve. What matters is that the economic model does not depend on turning human behaviour and data into a raw material.
Design does not eliminate the need for regulation. Devices can be compromised. Vendors can still exert control through defaults and ecosystems. But architectural restraint narrows the blast radius. Harm becomes more visible. Accountability becomes possible.
Conclusion — Choosing the Boundary
Jarvis is powerful not because he is limitless, but because he is bounded to Tony Stark and his success and well-being. That is the detail fiction gets right, even when it exaggerates everything else.
As AI systems become more present, more capable, and more tightly woven into daily life, the real risk is not runaway intelligence. It is “quiet misalignment”— systems that become indispensable before society has decided what they are allowed to see, infer, and optimise for.
We have already lived through one cycle of technological overreach with social platforms. The difference now is that the next layer of AI technology sits closer to perception of reality itself.
The future of AI will not be decided by a single breakthrough or a single regulation. It will be shaped by thousands of small choices — architectural, economic, and institutional — made long before the consequences are obvious.
Whether we end up with systems that genuinely augment human capability — or ones that merely optimise around it — depends on whether we choose to treat alignment as a foundational design constraint, rather than a corrective applied after the fact.
Review the Governance Frameworks
The future of our AI depends on these rules. I encourage you to explore these foundational principles shaping our coming digital world.
- EU AI Act Explorer: Understand how our favourite tools are being categorised and what protections we are legally entitled to.
- NIST AI Risk Management Framework: If you are a AI designer, builder or a leader, use this as your North Star for developing trustworthy systems.