Chaitat Utintu
Academic and research departments
Centre for Vision, Speech and Signal Processing (CVSSP), School of Computer Science and Electronic Engineering.About
I’m currently pursuing a PhD in AI at the University of Surrey, as part of the UKRI-funded CDT in AI for Digital Media Inclusion. My research is based at SketchX, CVSSP, and is supported by the CoSTAR National Lab. I'm working under the supervision of Professor Yi-Zhe Song and Dr. Pinaki Nath Chowdhury.
Previously, I completed my MSc in AI — also at the University of Surrey — with Distinction. My thesis explored the problem of sketch colourisation, conducted under the same supervision.
Before starting my PhD, I worked as a machine learning engineer at AI and Robotics Ventures (ARV), a subsidiary of PTTEP. There, I led computer vision projects for autonomous offshore mission and safety control measures, while also handling project management, MLOps, and model optimization for edge deployment.
ResearchResearch interests
My research interests lie in computer vision and generative modeling, where I am particularly fascinated by how machines can learn to create, transform, and understand visual data. I focus on image and video generation, exploring new ways for AI to extend human creativity, and on sketch-based applications that turn simple drawings into powerful tools for design and communication. I am also motivated by the challenges of applying vision systems in robotics, enabling machines to perceive and interact effectively with the physical world.
Research interests
My research interests lie in computer vision and generative modeling, where I am particularly fascinated by how machines can learn to create, transform, and understand visual data. I focus on image and video generation, exploring new ways for AI to extend human creativity, and on sketch-based applications that turn simple drawings into powerful tools for design and communication. I am also motivated by the challenges of applying vision systems in robotics, enabling machines to perceive and interact effectively with the physical world.
Publications
With the development of robotic technology, au-tonomous robots have been extended to production industries to substitute manual tasks like routine operations. In the general manufacturer, analog gauges are the most commonly utilized and required operators for manual reading. Accordingly, an analog gauge reading can be considered a fundamental feature for the operator robots to be fully automated for inspection purposes. This paper presents the methods for reading multiple analog gauges automatically using a camera. The processing pipeline consists of two main stages: 1) gauge detector for extracting individual gauges and 2) gauge reader for estimating gauge values. For gauge detectors, we propose three different YOLOvS architecture sizes. The gauge readers are mainly categorized into computer-vision approach (CV), and deep learning regression approaches. The deep learning approaches consist of two CNN-based backbones, ResNetSO and EfficientNetV2BO, and one transformer-based SwinTransformer. Finally, we introduce the feasibility of the combination of each gauge detector and reader. As a result, the YOLOv5m detector with EfficientNetV2BO CNN backbone reader theoretically achieves the best performance but is not practical for industrial applications. In contrast, we introduce the YOLOv5m detector with the CV method as the most robust multiple gauge reader. As a result, it reaches the comparative performances to the EfficientNetV2BO backbone and is more compatible with robotic applications.
Offshore oil and gas operations are most concerned with safety, it often requires workers to be stationed on the platform to control ongoing process. The cost of transportation can be expensive, and the working environment is dangerous, so it is natural for certain operations to be automated away by autonomous robots. One such operation is maneuvering valves, which is a common and important task most suitable for robotics. To achieve this goal, the robot itself need the ability to perceive the location and orientation of the target valve. In this paper, we proposed a computer vision methods based on the RGB-D image inputs for guiding the real-world spatial information of valves to the robot. In detail, we first exploited the YOLACT as an instance segmentation method to precisely extract the valves from the background then pass the segmented valve pixel to the 6D pose estimation algorithm. The 2D pixels and 3D points generated are then utilized by the DenseFusion algorithm to predict the valve's 6D pose composed of position and orientation based on fusion of RGB and depth features. In order to evaluate the validity of the proposed method, these algorithms were then implemented in ROS2 and tested on edge device which embedded in the real offshore robot system. The results show that our proposed method is promising and could be effectively utilized by the offshore autonomous robot to operate the valves.
Corrosion is one of the biggest problems that can lead to fatal disasters in the industry. Investigate corrosion and perform timely maintenance on the asset to prevent corrosion issues. However, the investigation of the inspector onsite can lead to a time-consuming and dangerous problem. For that reason, corrosion detection from offshore asset images is necessary. This paper proposes the implementation of a segmentation technique for automatically detecting corrosion damage on oil and gas offshore critical assets. We compare three semantic segmentation architectures, namely UNET, PSPNet, and vision transformer. The image data was collected by unmanned aerial vehicles (UAV). The experiment also compared the full-image dataset and sliced-image dataset with 512 × 512 pixels of the image. The results are calculated using the F1 score and IoU score of the predicted and annotated mask. The experiment shows that ViT-Adapter trained with a full-image dataset receives the best IoU score and F1 score, which are 0.8964 and 0.9451, respectively. However, the specialist inspector prefers the result from the slicing experiment since the slicing prediction offers a more precise corrosion mask.
An automatic speech recognition (ASR) is more important, especially in the Coronavirus outbreak. ASR for Thai sentence was proposed based on MFCC and CNNs in this research. The MFCC features image created from the Thai speech procedure is explained. The MFCC image is treated as a normal image. Object detection techniques based on CNNs can be used to detect Thai words in the frequency image. You Only Look Once (YOLO) is selected as the word localizer and classifier due to its performance and accuracy. The airport service scenario is explored in this research in order to obtain the performance of the proposed system. The airport information system is selected for the experiments. Speeches were collected from 60 participants with 50% males and 50% females. Speech images are constructed based on MFCC and labeled for specific Thai keywords. The YOLOv3 and Tiny YOLOv3 were trained and the performance was evaluated. Clearly, Tiny YOLOv3 network is good enough for this experiment. New speech data provided from new 20 participants were used to test the proposed system. Resulting in the proposed ASR system based on MFCC and CNNs has a good performance in both speed and accuracy.