Can we use stereo vision with ARKit to estimate floor plans?

When it comes to refurbishing your home the first hurdle has to be accurately measuring the room. Whether you stretch out the measuring tape yourself or seek the help of a professional, the process is extremely manual and mistakes at this early stage can be costly. By using cutting edge technology, however, we believe that people, like you or me, will soon be able to scan a room using a mobile phone to seamlessly create an accurate floor plan.


There are many ways to acquire 3D information, and in our previous blog post, we presented an overview of the most commonly used techniques. Although we have lots of experience with devices that generate 3D information such as Google Tango, Kinect and Structur Sensor from Occipital, these devices aren’t compact enough to carry around. As such our research is focused on smartphones equipped with ARKit and ARCore, which provide visual-inertial SLAM capabilities.

Estimating the structure of a room from a scan involves a number of steps. We’ve already written about how to compute a floor plan starting from a 3D Dense Point Cloud, but now want to focus on how to retrieve this information. In this blog post, we’ll analyse whether stereo vision, coupled with ARKit or ARCore, can be used to obtain a 3D Point Cloud of a room and the challenges associated with this approach.


Stereo vision is based on the concept that, pretty much like human vision, two different images of the same scene are sufficient to gather an idea of how far away objects appear in a scene. There’s lots of information available on this and multiple methods which encompass Visual SLAM, Multi View Stereo and Structure from Motion that can be used to approach the problem of estimating geometry from images. A few notable examples are: DTAM, LSD-SLAM, CNN-SLAM, Bundler and PMVS. However, in this blog we’re only talking about the classic stereo vision pipeline.

In particular, given two rectified images, it is possible to find correspondences between the two images and estimate a disparity map, which, for calibrated cameras, can easily be transformed into a Dense 3D Point Cloud.

Let’s first explore quickly what all of these terms mean. If you don’t want to get too deep into the theory, you can skip to the next section!

1) Epipolar geometry

Fig. 1: Image courtesy of  Wikimedia Commons
Fig. 1: Image courtesy of Wikimedia Commons

Imagine there is a 3D point X observed by two cameras. The constraint is that the 3D point X, the points obtained by projecting X on the two image planes and the two camera centres are co-planar as they all lie on what is known as epipolar plane. Furthermore, each projected point is restricted to lie on the epipolar line, which is the line of intersection between the epipolar plane and the image plane (i.e. the red line in Fig. 1). The epipole is the point of intersection between the line joining the two camera centres and the image plane. The distance between the two camera centres is called baseline. So the interesting thing here is that given the projection xL of point X in the left image, if we want to look for the corresponding point xR in the right image, we can limit our search to the points that lie on the epipolar line. This relation is modeled by the so-called Fundamental matrix F, according to which:

xLT F xR=0

2) Image rectification

Fig. 2: Image courtesy of  Wikiwand
Fig. 2: Image courtesy of Wikiwand

Rectification is the process of transforming both images to project them on one common plane, as if the cameras were parallel (see Fig. 2). This process allows easier and faster search for matches across the images. Fig. 3 shows an example of how two images look before and after rectification.  

Fig. 3: Courtesy: CVLAB. Left and right image before rectification at the top, left and right image after rectification at the bottom

3) Disparity map

After the images have been rectified, it’s easier to look for correspondences: for example given a pixel in the first image, where is the corresponding pixel (the same 3D point projected on the second camera) in the second image? Correspondences then allow us to compute the disparity map. In particular, disparity is the distance between a pixel in the left image and the same pixel in the right image. This set of distances can help us understand how far objects are: in fact, given two pairs of corresponding pixels, we can say that the pair with the smallest disparity corresponds to a point which is farther away with respect to the point represented by the pair with the greater disparity.

Fig. 4: Disparity map computed from the two images in Fig. 3.

Fig. 4: Disparity map computed from the two images in Fig. 3.

Fig. 4 shows an example of a disparity map, where closer pixels are darker, and farther pixels are lighter.  

4) Triangulation

The last step to obtain a dense 3D point cloud is triangulation: given the correspondence (x1,i,x2,i), where xj,i is the i-th point on the j-th image, intrinsic parameters of the camera and camera poses, we can easily compute the 3-D point Xi in the world frame (see Fig. 5).

Fig. 5: dense 3D point cloud computed from the images in Fig. 3. Different parameters and different algorithms (Semi Dense Stereo Matching or ELAS for example) leads to potentially quite different point clouds.


The pipeline is very straightforward. All we need is a bunch of images, the corresponding camera poses, and the intrinsic parameters of the camera. Luckily ARKit provides all the information we need! So, why don’t we give it a try on a couple of images taken using ARKit?

Fig. 6: 3D dense point cloud obtained using stereo vision on four pairs of images and camera poses obtained using ARKit.
Fig. 6: 3D dense point cloud obtained using stereo vision on four pairs of images and camera poses obtained using ARKit.

As you can see, if we simply apply a classic stereo vision pipeline we get a pretty convincing 3D point cloud. If we then apply the same method to all the pairs we get from scanning an entire room, this is what we obtain:  

Fig. 7: 3D dense point cloud obtained using stereo vision on many pairs of images and camera poses obtained using ARKit.

In this case, the point cloud looks quite noisy and it is possible to notice how different views are not necessarily aligned with each other. This is mostly due to the fact that ARKit camera poses are not extremely accurate and images are quite noisy. In order to solve this, we devised a very simple pipeline based on the idea of key-frames, that we got from SLAM techniques, coupled with outlier removal to get cleaner depth maps. A final fusion step is performed to fuse different views together, further remove outliers and obtain a single, coherent 3D dense point cloud. Here is what we get after we apply our pipeline on the pairs used to obtain Fig. 7:  

Fig. 8: 3D dense point cloud obtained using our pipeline.

Pretty cool, uh?

Stereo vision limitation

Stereo vision works very well when images have a lot of features. However, in order to build an accurate floor plan, we need information on the walls, which most of the times don’t have many features. In fact, if you look at Fig. 8, you’ll soon realise that walls have not been reconstructed at all, as it is very difficult to find correspondences on the wall if the wall looks the same everywhere. Unfortunately there is no way of solving this problem using only traditional stereo vision. In our next blog post, we’ll explore newer, deep learning-based techniques to obtain 3D information.


DigitalBridge is committed to estimating room structure and creating accurate floor plans from scans. However, it’s complex work! Our previous blog post explained how, starting from a 3D Point Cloud, we can get very accurate measures of a room, and today we’ve focused on how to retrieve the 3D Point Cloud. In particular, we’ve described how stereo vision can be used with ARKit to generate dense 3D Point Clouds and highlighted the main limitations of this method.

It’s clear that stereo vision alone, given its limitations, cannot generate a point cloud that can be used effectively with our current room structure estimation algorithm. In fact walls, which are the main source of information to infer room dimension and shape, are usually texture-less, and therefore stereo vision cannot estimate 3D points from them.

In our next blog post, we’ll explore newer, deep learning-based methodology to gather 3D information from images.

Search Pivot