Research Interests.
My research interests lie at the intersection of Deep Learning, Computer Vision, 3D Geometry and their applications in Augmented Reality and Robotics. I enjoy studying how deep learning can be applied to computer vision problems including keypoint detection, image matching, relocalization, multi-view reconstruction, visual SLAM, depth estimation, homography estimation, camera calibration and bundle-adjustment.

Currently a researcher at Facebook Reality Labs Research (FRL Research). Previously I was a member of the AI Research Team at Magic Leap I worked on developing new deep learning-based methods for Visual Simultaneous Localization and Mapping (Visual SLAM) and Structure-from-Motion (SfM). I was co-advised by Tomasz Malisiewicz and Andrew Rabinovich and authored publications at top-tier conferences including CVPR and RSS (e.g. Deep Homography Estimation, SuperPoint and SuperGlue). I also pioneered computer vision algorithms which were ultimately deployed on the ML1 headset. Prior to Magic Leap, I received my Master's and Bachelor's degrees at the University of Michigan, where I studied Machine Learning, Computer Vision and Robotics. During my studies I worked on various small projects in areas such as person tracking, outdoor SLAM and 3D ConvNets.

2020-now: Research Scientist at Facebook Deep Learning, 3D Mapping
2015-2020: Lead Software Engineer at Magic Leap Deep Learning, Visual SLAM, Mixed Reality
2014: Occipital Internship RGB-D SLAM, Augmented Reality
2013-2015: University of Michigan Master's Student Computer Vision, Machine Learning, Robotics
2008-2013: University of Michigan Bachelors's Student Robotics, Computer Science, International Studies

April 2020: Published PyTorch code for SuperGlue, includes live demo and easy-to-use evaluation code.
March 2020: SuperGlue: Learning Feature Matching with Graph Neural Networks is accepted to CVPR 2020 as an Oral.
March 2019: Deep ChArUco: Dark ChArUco Marker Pose Estimation is accepted to CVPR 2019.
November 2018: Invited talk at Berkeley Artificial Intelligence Research Lab (BAIR).
October 2018: Invited Keynote Talk at the Bay Area Multimedia Forum Keynote (BAMMF) series in Palo Alto, CA.
July 2018: Presented SuperPoint at ICVSS 2018 in stunning Sicily.
June 2018: Published PyTorch code for SuperPoint. Get up and running in 5 minutes or your money back!
April 2018: SuperPoint selected as an oral at the 1st International Workshop on Deep Learning for Visual SLAM at CVPR in Salt Lake City.


SuperGlue: Learning Feature Matching with Graph Neural Networks
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly.
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
CVPR 2020 (Oral)
Deep ChArUco: Dark ChArUco Marker Pose Estimation
We present a real-time pose estimation system which combines two custom deep networks, ChArUcoNet and RefineNet, with the Perspective-n-Point algorithm to estimate the marker's 6DoF pose. ChArUcoNet is a convolutional neural network which jointly outputs ID-specific classifiers and 2D point locations. The 2D point locations are further refined into subpixel coordinates using RefineNet. We evaluate Deep ChArUco in challenging scenarios and demonstrate that our approach is superior to a traditional OpenCV-based method.
Danying Hu, Daniel DeTone, Tomasz Malisiewicz
CVPR 2019
Self-Improving Visual Odometry
We propose a self-supervised learning framework that uses unlabeled monocular video sequences to generate large-scale supervision for training a Visual Odometry (VO) frontend. Our proposed frontend consists of a single multi-task CNN which outputs 2D keypoints locations, keypoint descriptors, and a novel point stability score. When trained using VO at scale on 2.5 million images, the stability classifier automatically discovers a ranking for keypoints that are not likely to help in VO, such as t-junctions across depth discontinuities, features on shadows and highlights, and dynamic objects like people.
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
arXiV 2018
SuperPoint: Self-Supervised Interest Point Detection and Description
This work presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. Our model, when trained on the MS-COCO image dataset, is able to repeatedly detect a rich set of interest points and stably track them over time.
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
CVPR 2018 Deep Learning for Visual SLAM Workshop
Toward Geometric Deep SLAM
We present a point tracking system powered by two CNNs. The first network, MagicPoint, operates on single images and extracts salient 2D points. As transformation estimation is more simple when the detected points are geometrically stable, we designed a second network, MagicWarp, which operates on pairs of point images and estimates the homography that relates the inputs.
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
arXiV 2017
Deep Image Homography Estimation
We present a deep convolutional neural network called HomographyNet for estimating the relative homography between a pair of images. We use a 4-point homography parameterization which maps the four corners from one image into the second image. The network is trained end-to-end using warped MS-COCO images, allow the use of large-scale training without time-consuming data collection. The HomographyNet does not require separate local feature detection and transformation estimation stages and outperforms a traditional homography estimator based on ORB.
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
RSS 2016 Workshop: Limits and Potentials of Deep Learning in Robotics
show more projects
3D Spatial Convnets for Semantic Segmentation
By training a 3D spatial convnet to recognize 127,915 CAD Models in 662 different categories, we can develop a rich feature hierarchy for performing 3D semantic segmentation.
Daniel DeTone, Matthew Johnson-Roberson
Winter 2015
Structure Sensor SDK
We built an SDK for developers to use with the Structure Sensor that includes sample code for 3D object capture, 3D room mapping, and augmented reality gaming.
Summer 2014
Simultaneous Environment Discovery & Annotation
SEDA is a project for enhancing human learning by using state of the art techniques from AI. The non-technically constrained goal is to create an overlay to human vision to help with tasks humans are inherently bad at such as memory, calculations, and abstractions and to help speed up tasks such as looking up information and referencing material.
Michigan Student AI Lab (MSAIL)
Winter 2014
Scene Text Detection and Recognition
We built an end-to-end scene text detection and recognition framework that builds off of some recent published work of Lukas Neumann using an extremal region (ER) classifier and efficient exhaustive search.
Michigan Student AI Lab (MSAIL)
Winter 2014
Robust Locally Weighted Regression for Aesthetically Pleasing Region-of-Interest Video Generation
We provide a method that takes the output from an object tracker and creates a smoothed RoI to be viewed as the final output video. To accomplish this, we use a variation of linear regression, namely, robust locally weighted linear regression (rLWLR-Smooth).
ATLAS Collaboratory Project
Parallel Tracking and Mapping for Outdoor Localization
By removing some of the long term pose optimizations and by limiting the allowed number of bundle adjustment iterations, I was able to modify PTAM to work in an outdoor localization setting. This work was used to help improve the accuracy of a multi-target tracking system.
Daniel DeTone, Yu Xiang, Silvio Savarese
Summer 2013
Robotics Competition for Autonomous SLAM and Path Planning
We entered a mobile robot, equipped with a fisheye camera and laser pointer, in a robotics competition. To win, the robot must autonomously map a small area, shoot green triangles, and return to a starting point. We implemented a fast agglomerative line fitting algorithm, a graph-based SLAM algorithm, and a memory efficient quad-tree for map storage. Our team finished 2nd out of 8 teams.
Daniel DeTone, Ibrahim Musba, Jonathan Bendes, Andrew Segavac
Winter 2013
Projectile Prediction and Robotic Retrieval using Kinect RGBD Video
We developed a fully automated projectile-catching robot by affixing a small basket to a mobile robot and predicting the projectile's landing position in real-time. We implemented a detection algorithm using RGBD video from a Kinect and an estimation algorithm using linear regression. Once the landing position was calculated, we used dead-reckoning and a PID controller to navigate the mobile robot.
Daniel DeTone, Rohan Thomare, Max Keener
Winter 2013
Tracking-by-detection in a Lecture Hall Setting
We present a framework for tracking a single human (person-of-interest) in a lecture hall environment. It is a tracking-by-detection framework that uses a generic person detector, a novel scoring function to solve the data association problem, and a Kalman filter that provides reliable state estimation. In our scoring function, we introduce two novel subcomponents: a subscore based on the target’s width and a subscore based on the color histogram of him/her at the first time step.
ATLAS Collaboratory Project
Fall 2013
Particle Filter Tracking in a Lecture Hall Setting
Proof of concept for using a deformable parts model in conjunction with a particle filter and efficient MCMC sampling.
ATLAS Collaboratory Project
Fall 2013
Linear array of photodiodes to track a human speaker for video recording
We present a human lecturer tracking and recording system that consists of a pan/tilt/zoom (PTZ) color video camera, a necklace of infrared LEDs and a linear photodiode array detector. Electronic output from the photodiode array is processed to generate the location of the LED necklace, which is worn by a human speaker. The LED necklace is flashed at 70Hz at a 50% duty cycle to provide noise-filtering capability.
Daniel DeTone, Homer Neal, Bob Lougheed
JoP:CS 2012