DALL·E 2022-08-16 13.42.59 - stillshot of cybernetic person in augmented reality headset,

J[AR]VIS

Intelligent AR assistant for real-time insights

Overview

The emergence and eventual mass adoption of AR technologies will create a mass explosion in data, as our daily lives are recorded and our eye-gaze and correlated attention is tracked with on-headset sensors. This data can be used to understand the way we think, what we care about, what we desire, and what could help improve our lives. This project aims to leverage the massive amounts of data provided by daily AR glasses to create a "superintelligent" AR assistant by combining cognitive architectures and large language models.

J[AR]VIS is a novel educational system that provides info, insights, and visualizations to users both automatically and upon explicit query. The system leverages powerful AI and AR technology semantically understand the environment the user is in, using computer vision algorithms and microphone data, and uses a language model to answer virtually any question the user makes. As an AR experience, 3D visualizations of objects and data can be automatically coded using the AI. One can simply look at anything in their world, ask a question about it, and receive an intelligent answer. As the system is conversational, knowledge can be built on from past queries, allowing for a highly personal and high bandwidth educational tool.

Goals

Create an AR system that users can ask questions and receive intelligent answers from.
Develop NLP and computer vision techniques that enable understanding and Q/A interactions.
Develop a cognitive architecture for the AR assistant that controls information processing and memory.
Create visual and audio feedback in addition to text responses from the assistant.
Test the device in various environments to evaluate and optimize performance.

Outcomes

We developed a proof-of-concept in which users can ask J[AR]VIS any question about the environment around them and receive an answer.
This system involves passing a picture of the scene and the user's transcribed speech to a server that uses a multimodal ML model to construct an intelligent response.

How We Built It

The application was built using Unity and MRTK in HoloLens 2. On the frontend, the application listens to user speech and transcribes the speech into text. When the user asks a question, the application sends their question, along with a photo of the surrounding environment, to our server. On the server, we use Azure's dense captioning system to generate classifications of objects in the scene and then use chat conversational agents with built-in memory to generate responses based on the visual and text input. Our cognitive architecture uses langchain to determine which models to use (i.e. understanding contents of an image, searching the internet for more information, etc.).

Future Plans

We are considering running research studies investigating the effectiveness of this system in educational contexts, as well as building out a more fully featured version of the system including producing real-time visualizations and internet search results.