# 基于深度学习的目标检测框架

## 1. A Mobile Outdoor Augmented Reality Method Combining Deep Learning Object Detection and Spatial Relationships for Geovisualization

geovisualization：地学可视化


## 1.1 Deep-Learning-Based Object Detection

Figure 1. An overview of R-CNN. This method first takes an image as an input, then extracts approximately 2000 region proposals and computes features from each proposed region using a deep CNN, and finally uses linear SVMs to classify those proposed regions.

Figure 2. An overview of SSD. SSD first takes an image as input, then extracts features by means of a base network (e.g., a truncated VGG-16 network without classification layers) and several additional feature layers to obtain multi-scale feature maps, subsequently obtains initial detection results through multiway classification and box regression using a set of convolutional filters, and finally applies Non-Maximum Suppression (NMS) to eliminate redundant results.

Figure 3. Macro architectural view of SqueezeNet v1.1 (inspired by Figure 2). Processing begins with a convolutional layer (conv1), followed by 8 fire modules (structures proposed in SqueezeNet, which have fewer parameters than normal convolutional layers without sacrificing competitive accuracy), and ends with a convolutional layer (conv10) and a softmax classifier. SqueezeNet takes as input a 224 × \times 224 pixel image with 3 colour channels (R, G and B).

sacrifice ['sækrɪfaɪs]：n. 牺牲，祭品，供奉 vt. 牺牲，献祭，亏本出售 vi. 献祭，奉献


Figure 4. The proposed lightweight SSD architecture (inspired by Figure 2). This architecture follows a design similar to that of the original SSD. The main differences are that it takes a 224 × \times 224 pixel image as input and then uses a truncated SqueezeNet (rather than VGG-16) and a series of additional layers (at lower depths than the original) to extract features from the image. The features it uses for detection are selected from 5 layers: fire9 (the last fire module in the SqueezeNet), Ex1_2, Ex2_2, Ex3_2 (three convolutional layers) and GAP (a global average pooling layer).

Figure 5. The 2D screen coordinate system, the 3D real world coordinate system and the relationships between them. A detected bounding box is described in the 2D screen coordinate system. The 3D real world coordinate system is established on the basis of the view frustum created by the visual sensor, with the origin at the centre of the visual sensor. The X and Y axes are parallel to the screen. The Z axis, which corresponds to the negative direction of the visual sensor’s orientation, is perpendicular to the screen. The 2D coordinates of the detected bounding box can be converted into target bounding box coordinates on the target plane in the 3D real world coordinate system for virtual object registration.

virtual [ˈvɜ:tʃuəl]：adj. 虚拟的，实质上的，事实上的 (但未在名义上或正式获承认)
registration [redʒɪ'streɪʃ(ə)n]：n. 登记，注册，挂号