This is a short summary of the course Introduction to on-device AI by provided by DeepLearning.ai.

Introduction

Modern smartphones, with their significant computational power of 10 to 30 teraflops, can run multiple AI models simultaneously for applications like real-time semantic segmentation and scene understanding. This course explores creating AI applications that run on-device, applicable to smartphones and billions of other devices such as cameras, robots, drones, and VR headsets. Despite differences in hardware and operating systems, the principles for deploying on-device AI are similar. Initially, models trained in the cloud need to be converted into a compatible format, involving freezing the model into a neural network graph and converting it into an executable format for the specific device.

Devices like smartphones and edge devices usually contain a mix of CPUs, GPUs, and NPUs, enabling performance optimizations tailored to specific hardware. Understanding the target devices can significantly enhance performance, making applications run up to ten times faster. The course also covers tools to ensure consistent model performance across various devices, crucial due to the diversity of over 300 different smartphone types. Additionally, validating numerical correctness prevents discrepancies in model operation due to hardware differences. Quantization, reducing the model's calculation precision, is another important aspect, making apps run several times faster while decreasing model size, improving performance, and reducing the footprint.

Why On-Device

Learning about on-device AI offers several benefits, including reduced latency, improved efficiency, lower costs, and enhanced privacy. Applications of on-device AI are prevalent in real-world scenarios such as real-time speech detection, semantic segmentation, object detection, and physical activity detection. Each time you take a picture with your smartphone, over 20 AI models work in tandem to optimize the image within milliseconds. The industrial IoT sector alone sees an estimated economic impact of around $3 trillion due to these technologies.

On-device AI is integrated into many everyday technologies. For instance, typing on a laptop keyboard or interacting with a smart speaker relies on AI running locally. Robots for delivery and assembly, as well as drones used in industrial and agricultural settings, utilize on-device AI. This technology is also pivotal in advanced driver assistance systems in cars and in editing photos on smartphones and laptops. Various applications across audio, image, video, and sensor data—like text-to-speech, speech recognition, photo classification, and physical activity detection—demonstrate the versatility of on-device AI. Running models on-device is advantageous for several reasons.

It is cost-effective as it leverages local computational resources, thereby eliminating the need for cloud services.
Its efficiency improves as data processing occurs locally, avoiding the delays associated with cloud communication.
Privacy is enhanced since data remains on the device.
Personalization is possible because models can be tailored locally without external data.

Deploying Segmentation Models On-Device

On-device AI has a wide range of applications, including real-time object detection to identify people, faces, and QR codes; speech recognition to convert spoken input into text; and pose estimation to predict human poses from real-time images or video streams. Additionally, it includes image generation from text descriptions, super-resolution to enhance low-resolution images, and image segmentation, which involves breaking down an image into meaningful segments for easier analysis and object identification.

Image segmentation comes in various forms, with the most popular being semantic segmentation and instance segmentation.

Semantic segmentation labels every pixel according to a specific class.
Instance segmentation labels each pixel by class and individual instance within the same category.

The latter is widely used in advanced driver assistance systems to differentiate roads from pedestrians, in image editing software to apply filters or blur backgrounds, and in video conferencing to blur backgrounds during calls. It is also employed in drones for mapping landscapes in agricultural and industrial settings.

Several models are used for semantic segmentation, including ResNet, which employs residual connections for training deep networks, and HRNet, which captures fine-grained details through high-resolution representations. FANet, or Feature Agglomeration Network, aggregates features from different scales for detailed predictions, while DDRNet, or Dual Dynamic Resolution Network, balances efficiency and accuracy with a dual-path architecture.

We will focus on FFNET, or Fuss Free Network, which features a simple encoder-decoder architecture with a ResNet-like backbone and a small multi-scale head. FFNET offers comparable accuracy to more complex networks like HRNet and FANet, but with greater computational efficiency, making it ideal for on-device deployment. Its configurability allows for various encoder and decoder sizes to suit different environmental and accuracy needs.

Preparing for On-Device Deployment

The first step is neural network graph capture, which involves capturing the computation to be executed on the device. The second step is on-device compilation. The third step focuses on accelerating models using the device's hardware. The fourth step emphasizes validating the numerical correctness of the model on the device.

There are three popular runtimes for on-device deployment. TensorFlow Lite is recommended for Android applications. ONNX Runtime is suitable for Windows-based applications. The Qualcomm AI Engine is ideal for fully embedded applications on Qualcomm's hardware.

Let's delve into the TensorFlow Lite runtime. Specifically designed for mobile platforms and embedded devices, TensorFlow Lite delivers extremely efficient performance with fast response times by minimizing computation overhead. Its flexibility allows deployment on smartphones and various IoT devices, making it highly portable and ubiquitous. TensorFlow Lite is also energy-efficient, consuming less power. It supports hardware acceleration and is fully compatible with neural processing units (NPUs) through a mechanism called delegation.

Quantizing Models

Quantization reduces the precision of your model to speed up computation and decrease model size. This process can make models up to four times smaller and faster. The three main benefits of quantization are: reduced model size for better storage on devices with limited capacity, faster processing due to fewer computations, and lower power consumption, which is crucial for battery-operated devices.

To understand quantization, consider a floating point tensor, which takes 32 bits per value. Quantization converts this to an integer representation with 8 bits per value, using a scale and zero point for accurate translation. The goal is to minimize error while converting from floating point to integer, applicable to various scenarios.

There are two main types of quantization:

Weight quantization, which reduces the precision of model weights to optimize storage.
Activation quantization, which applies lower precision to activation values to accelerate inference using lower precision numerics.

Different levels of quantization exist, such as:

W8int8, converting both weights and activations to 8-bit.
W8int16, converting weights to 8-bit and keeping activations at 16-bit.
Extreme quantization like W4 (weights at 4-bit and activations at 16-bit) is popular for large language models and generative AI applications.

Quantization methods include:

Post-training quantization
Quantization-aware training

Post-training quantization applies the process after training, using calibration with sample data to minimize quantization error. This typically requires a few hundred samples.

Quantization-aware training incorporates quantization into the training process, ensuring the model learns integer values, which can yield more accurate models when post-training quantization is insufficient. This approach integrates quantization into the training process, learning the model weights and integer representation simultaneously.

Device Integration

The following outlines the steps for integrating an AI model into a smartphone application for real-time segmentation at 30 frames per second. Here is a summary of the process:

Understanding Data Transformation: The data flow begins with the camera stream, which provides frames in either RGB or YUV format. The GPU is used for pre-processing to convert YUV to RGB and downsample to the model's required resolution (e.g., from 720p to 224x224).
Device Integration Process: Integration involves several stages:
- Camera Stream Extraction: Extract frames at 30 frames per second in either RGB or YUV format.
- Pre-processing: Utilize the GPU and OpenCV for efficient conversion and downsampling of the frames to match the model's resolution.
- Model Inference: Run the AI model on the device's Neural Processing Unit (NPU) using runtime APIs (C++ or Java-based) for optimal performance. The model outputs a mask of the same resolution as it was trained on (e.g., 224x224).
- Post-processing: Use the GPU and OpenCV again for tasks like upsampling, smoothing, thresholding, and blurring to overlay the output mask onto the camera stream.
- Packaging Runtime: Ensure all runtime dependencies, including the AI models, are packaged within the application to leverage hardware acceleration effectively.
Implementation: There are five key parts:
- Camera Stream Extraction: Capture frames from the camera.
- Pre-processing: Convert and downsample frames using the GPU.
- Model Inference: Perform inference on the NPU.
- Post-processing: Overlay the model's output on the camera stream using GPU-based techniques.
- Packaging: Include all necessary runtime dependencies in the application to ensure it runs efficiently on the device. This involves managing Java, native sources, and dependencies within an Android project.