Deep Learning-based Building Footprint Extraction with Missing Annotations

Most state-of-the-art deep learning-based methods for extraction of building footprints are aimed at designing proper convolutional neural network (CNN) architectures or loss functions able to effectively predict building masks from remote sensing (RS) images. To properly train such CNN models, large-scale and pixel-level building annotations are required. One common approach to obtain scalable benchmark datasets for segmentation of buildings is to register RS images with auxiliary geospatial information data, such as those available from OpenStreetMaps (OSM). However, due to land-cover changes, urban construction, and delayed geospatial information updating, some building annotations may be missing in the corresponding ground-truth building mask layers. This will likely introduce confusion in the training of CNN models for discriminating between background and building pixels. To solve this important issue, we ﬁrst formulate the problem as a long-tailed classiﬁcation one. Then, we introduce a new joint loss function based on three terms: 1) logit adjusted cross entropy (LACE) loss, aimed at discriminating

between building and background pixels from a long-tailed label distribution; 2) weighted dice loss, aimed at increasing the F 1 scores of the predicted building masks; and 3) boundary alignment loss, which is optimized for preserving the fine-grained structure of building boundaries. Our experiments, conducted on two benchmark building segmentation datasets, validate the effectiveness of our newly proposed loss with respect to other state-of-the-art losses commonly used for extracting building footprints. The codes of this paper will be publicly available from https://github.com/jiankang1991/GRSL BFE MA.

Index Terms
Building extraction, semantic segmentation, deep learning, missing labels, remote sensing.

I. INTRODUCTION
Extracting building footprints from high-resolution remote sensing (RS) images has been a fundamental task within the field of intelligent image interpretation. Footprint maps of buildings play an important role in several different tasks, such as urban planning, disaster monitoring, change detection, and autonomous driving. Thus, accurately generating footprints of buildings is always an on-going and hot topic in the RS community. Nowadays, with the rapid development of satellite sensors, massive volumes of high-resolution RS images are available for developing effective building footprint extraction techniques. Moreover, such big data also foster the development of deep learning-based methods for extracting footprints of buildings in an end-to-end manner [1]- [5].
One of the first convolutional neural network (CNN) architectures adopted for the extraction of building footprints was the fully convolutional network (FCN) [6], which replaces fully connected layers with convolutional layers to create building masks of the same size with respect to the input RS images. Liu et al. proposed an encoder-decoder CNN framework [with a spatial residual inception (SRI) module] for capturing and fusing the multi-scale features during the phase of building extraction [7]. Based on the feature pyramid network, Wei et al. developed a multiscale aggregation FCN with polygon regularization for refining the boundaries of buildings [8]. In order to accelerate the computational performance of an encoder-decoder framework intended to process very large input images, Li et al. introduced a multiple-feature reuse network (MFRN) that enabled the direct use of hierarchical features and achieved prominent building segmentation performance [9]. The feature pairwise conditional random field (FPCRF) integrated in CNN model was also used for preserving sharp boundaries and fine-grained building segments [10]. By combining a multi-scale feature extraction strategy and attention mechanisms, Zhu et al. proposed a multiple attending path neural network (MAP-Net) which could precisely generate multi-scale footprints of buildings and accurate polygons [11]. Rather than learning the building masks, PolygonCNN was also proposed for directly generating vector building polygons based on an encoder-decoder CNN framework, in an end-to-end manner [12]. Another perspective for designing deep learning-based approaches for the extraction of building footprints was based on the considered loss function. Although most of the above-mentioned methods exploit the cross entropy (CE) loss, there are some works aimed at optimizing the loss design for accurately predicting building regions and boundaries. For example, Yuan proposed to use the signed distance function, which calculates the distance from the pixels to their nearest points on the boundaries, to accurately capture the building shapes [13]. Wu et al. exploited the boundary loss as a regularizer of the region-based CE loss for extracting building segments and outlines [14]. Bokhovkin et al. introduced a differentiable (surrogate) loss for penalizing the misalignment of building boundaries [15].
All the above-mentioned methods for building footprint extraction require accurate building area annotations. There are of course unsupervised solutions like [16] that can achieve remarkable building footprint extraction from aerial remote sensing data in a efficient unsupervised way, but most building segmentation benchmark datasets are just constructed based on the geo-registration between RS images and some auxiliary geospatial information data, e.g., OpenStreetMaps (OSM), in order to avoid the expensive and time-consuming human labeling procedure. Nonetheless, under this scenario, missing annotations ( Figure 1) may often appear in the corresponding ground-truth building mask layer due to several reasons, including landcover changes, urban construction, delayed updating, or even low-quality volunteered geographic information (VGI). Logically, all these factors may result in the potential confusion of trained CNN models when discriminating between the background and building pixels. To relieve these issues, we first formulate the problem as a long-tailed classification one. Then, we introduce a new joint loss function that considers the possible existence of missing building annotations in the dataset. Our newly developed loss function includes three terms: 1) logit adjusted cross entropy (LACE) loss, aimed at discriminating between the building and background pixels from a long-tailed label distribution; 2) weighted dice loss, aimed at increasing the F 1 scores of the predicted building masks; and 3) boundary alignment loss, optimized for preserving the fine-grained structures of building boundaries. Our newly proposed loss is evaluated on two benchmark datasets, outperforming other state-of-the-art competitors. The contributions of this letter can be summarized as follows: 1) To our best knowledge, this is the first paper in the literature that investigates the problem of deep learning-based building segmentation with missing annotations, approaching it from the perspective of designing an effective loss function to specifically deal with this issue.
2) We formulate the task as a long-tailed classification problem and then introduce a new joint loss function.
3) Compared with other state-of-the-art methods, the proposed loss function achieves the best performance on two widely used benchmarks.
The reminder of this letter is structured as follows. Section II describes the proposed joint loss for guiding the optimization of CNN models when the input dataset contain missing annotations.
Section III describes the conducted experiments and analyzes the results. Finally, Section IV concludes the letter with some remarks and hints at plausible future research lines.

A. Notations
Let X = {X 1 , · · · , X N } denote a building extraction dataset consisting of N images with binary masks, and Y = {Y 1 , · · · , Y N } be the associated set of binary masks, where each element is either 0 or 1, i.e., y ij ∈ {0, 1}. In this letter, we denote 1 as the building area and 0 as the background. f (·) represents the CNN model which maps the input image X i to the predicted building mask Y i .
B. The Proposed Joint Loss Function 1) logit adjusted cross entropy (LACE): When building annotations are missing, an imbalanced or long-tailed label distribution exhibits. As an illustrative example, we can randomly select 30%, 50% and 90% buildings from the well-known Massachusetts Buildings Dataset [17] and flip the associated building labels to background labels. Then, we calculate the pixel percentages of the two classes. As shown in Figure 2(a), as the number of missing building annotations increases, the long-tailed label distribution becomes more obvious. Therefore, the CE loss, a conventional loss used for training CNN models, will simply guide CNN models to classify every pixel with the majority label, i.e., background. However, such models cannot generalize well in the testing phase. To cope with this issue, we adopt the LACE [18]: where τ denotes a temperature parameter and π y is an estimate of the class prior P (y), e.g., an empirical class frequency on the training set. Basically, LACE tries to set a pairwise label margin log π y πy τ between the predicting scores for y and y as shown in Figure 2 2) Weighted Dice: In addition to the commonly utilized CE loss for binary segmentation problems, another region-based loss called Dice loss is also adopted. As opposed to the CE loss, which aims at optimizing the precision scores, minimizing the Dice loss is designed to increase the F 1 score: where TP, FN and FP respectively denote true positives, false negatives and false positives on the predicted mask given the ground-truth. Considering the fact that we are dealing with a long-tailed label distribution, we utilize the weighted Dice loss, wherein the class-wise Dice losses are averaged in a weighted manner as follows: 3) Boundary Alignment: Accurate building boundary generation is very important for footprint extraction. However, the two losses above are region-based and cannot penalize the boundary misalignment. In order to align the predicted building boundaries with the ground-truth, the boundary loss proposed in [15] is also adopted: where P y and R y denote the precision and recall scores of the boundary pixels with class y.
To this end, when the benchmark datasets contain missing annotations, the proposed joint loss function for building footprint extraction is formulated as:

C. CNN model
In this letter, the standard U-Net [19] architecture is exploited for the building footprint extraction. U-Net fuses multi-level feature maps to simultaneously capture hierarchical semantics and preserve fine-grained shapes of objects in the predicted masks. We choose U-Net as the CNN backbone since it has been widely adopted as a benchmark CNN model for binary segmentation problems. However, it is worth noting that other CNN models with different architectures can be also combined with the proposed loss function for building footprint extraction.

A. Experimental Setup
We evaluate the proposed loss function on two building segmentation datasets including: 1) Massachusetts Buildings dataset [17], and 2) ISPRS Potsdam dataset 1

IV. CONCLUSIONS
This letter presents a new joint loss function for extracting footprints of buildings using deep learning technology, under the assumption that there are many buildings that are not annotated.
In order to solve this problem, we first investigate the label distribution when there are missing annotations at different levels. Then, we formulate the problem as a long-tailed classification one, and propose a joint loss function including: 1) LACE; 2) weighted Dice; and 3) boundary alignment loss to optimize the CNN model and better predict region and boundary pixels.
Based on two building segmentation benchmark datasets, we validate the proposed loss function compared with other state-of-the-art approaches, and achieve the best performance when 30%, 50% and 90% buildings are missing in the training sets. The proposed joint loss function can be applied with any CNN architecture for binary segmentation problems when there are missing annotations. Robust deep learning techniques [23] considering missing annotations as noisy labels will be adopted in the future.