Mobile air pollution sensing methods have emerged to collect air quality data with improved spatial and temporal resolutions. However, existing methodologies struggle to effectively process spatially mixed gas samples due to the highly dynamic fluctuations experienced by sensors, resulting in significant measurement deviations. We identify an opportunity to address this issue by exploring potential patterns within sensor measurements. To this end, we propose CatUA, a novel city-scale fine-grained air quality estimation system designed to deliver accurate mobile air quality data. First, we design AirBERT, a representation learning model specifically aimed at discerning mixed gas concentrations from sensor data. Second, we implement a Prompt-informed Training Strategy that leverages extensive unlabeled and minimal labeled city-scale data to enhance the performance of CatUA. Notably, the Auto-Prompt mechanism allows CatUA to conveniently acquire new knowledge tailored to specific downstream tasks. To ensure the practicality of CatUA, we have invested considerable effort in developing the software stack on our meticulously crafted Sensing Front-end, which has successfully gathered city-scale air quality data for over 1,200 hours. Experiments conducted on the collected data demonstrate that CatUA reduces sensing errors by 96.9% with a latency of only 44.9 ms, outperforming the state-of-the-art baseline by 42.6%.