Subscribe to the PwC Newsletter

Join the community, computer vision, semantic segmentation.

research paper of computer vision

Tumor Segmentation

research paper of computer vision

Panoptic Segmentation

research paper of computer vision

3D Semantic Segmentation

research paper of computer vision

Weakly-Supervised Semantic Segmentation

Representation learning.

research paper of computer vision

Disentanglement

Graph representation learning, sentence embeddings.

research paper of computer vision

Network Embedding

Classification.

research paper of computer vision

Text Classification

research paper of computer vision

Graph Classification

research paper of computer vision

Audio Classification

research paper of computer vision

Medical Image Classification

Object detection.

research paper of computer vision

3D Object Detection

research paper of computer vision

Real-Time Object Detection

research paper of computer vision

RGB Salient Object Detection

research paper of computer vision

Few-Shot Object Detection

Image classification.

research paper of computer vision

Out of Distribution (OOD) Detection

research paper of computer vision

Few-Shot Image Classification

research paper of computer vision

Fine-Grained Image Classification

research paper of computer vision

Learning with noisy labels

Text retrieval, deep hashing, table retrieval, 2d object detection.

research paper of computer vision

Edge Detection

research paper of computer vision

Open Vocabulary Object Detection

Thermal image segmentation, reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, domain adaptation.

research paper of computer vision

Unsupervised Domain Adaptation

research paper of computer vision

Domain Generalization

research paper of computer vision

Test-time Adaptation

Source-free domain adaptation, image generation.

research paper of computer vision

Image-to-Image Translation

research paper of computer vision

Text-to-Image Generation

research paper of computer vision

Image Inpainting

research paper of computer vision

Conditional Image Generation

Data augmentation.

research paper of computer vision

Image Augmentation

research paper of computer vision

Text Augmentation

research paper of computer vision

Image Denoising

research paper of computer vision

Color Image Denoising

research paper of computer vision

Sar Image Despeckling

Grayscale image denoising, autonomous vehicles.

research paper of computer vision

Autonomous Driving

research paper of computer vision

Self-Driving Cars

research paper of computer vision

Autonomous Navigation

research paper of computer vision

Simultaneous Localization and Mapping

Contrastive learning.

research paper of computer vision

Meta-Learning

research paper of computer vision

Few-Shot Learning

research paper of computer vision

Sample Probing

Universal meta-learning, super-resolution.

research paper of computer vision

Image Super-Resolution

research paper of computer vision

Video Super-Resolution

research paper of computer vision

Multi-Frame Super-Resolution

research paper of computer vision

Reference-based Super-Resolution

Pose estimation.

research paper of computer vision

3D Human Pose Estimation

research paper of computer vision

Keypoint Detection

research paper of computer vision

3D Pose Estimation

research paper of computer vision

6D Pose Estimation

Self-supervised learning.

research paper of computer vision

Point Cloud Pre-training

Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.

research paper of computer vision

Scene Parsing

research paper of computer vision

Reflection Removal

Visual question answering (vqa).

research paper of computer vision

Visual Question Answering

research paper of computer vision

Machine Reading Comprehension

research paper of computer vision

Chart Question Answering

Chart understanding.

research paper of computer vision

Depth Estimation

research paper of computer vision

3D Reconstruction

research paper of computer vision

Neural Rendering

research paper of computer vision

3D Face Reconstruction

Anomaly detection.

research paper of computer vision

Unsupervised Anomaly Detection

research paper of computer vision

One-Class Classification

Supervised anomaly detection, anomaly detection in surveillance videos, sentiment analysis.

research paper of computer vision

Aspect-Based Sentiment Analysis (ABSA)

research paper of computer vision

Multimodal Sentiment Analysis

research paper of computer vision

Aspect Sentiment Triplet Extraction

research paper of computer vision

Twitter Sentiment Analysis

research paper of computer vision

Temporal Action Localization

research paper of computer vision

Video Understanding

Video generation.

research paper of computer vision

Video Object Segmentation

Video retrieval, 3d object super-resolution.

research paper of computer vision

One-Shot Learning

research paper of computer vision

Few-Shot Semantic Segmentation

Cross-domain few-shot.

research paper of computer vision

Unsupervised Few-Shot Learning

Activity recognition.

research paper of computer vision

Action Recognition

research paper of computer vision

Human Activity Recognition

research paper of computer vision

Group Activity Recognition

Egocentric activity recognition, medical image segmentation.

research paper of computer vision

Lesion Segmentation

research paper of computer vision

Brain Tumor Segmentation

research paper of computer vision

Cell Segmentation

Skin lesion segmentation, exposure fairness, monocular depth estimation.

research paper of computer vision

Stereo Depth Estimation

Depth and camera motion.

research paper of computer vision

3D Depth Estimation

Facial recognition and modelling.

research paper of computer vision

Face Recognition

research paper of computer vision

Face Swapping

research paper of computer vision

Face Detection

research paper of computer vision

Facial Expression Recognition (FER)

research paper of computer vision

Face Verification

Instance segmentation.

research paper of computer vision

Referring Expression Segmentation

research paper of computer vision

3D Instance Segmentation

research paper of computer vision

Unsupervised Object Segmentation

research paper of computer vision

Real-time Instance Segmentation

Optical character recognition (ocr).

research paper of computer vision

Active Learning

research paper of computer vision

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, quantization, data free quantization, unet quantization, zero-shot learning.

research paper of computer vision

Generalized Zero-Shot Learning

research paper of computer vision

Compositional Zero-Shot Learning

Multi-label zero-shot learning, continual learning.

research paper of computer vision

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, object tracking.

research paper of computer vision

Multi-Object Tracking

research paper of computer vision

Visual Object Tracking

research paper of computer vision

Multiple Object Tracking

research paper of computer vision

Cell Tracking

research paper of computer vision

Action Recognition In Videos

research paper of computer vision

3D Action Recognition

Self-supervised action recognition, few shot action recognition, scene understanding.

research paper of computer vision

Video Semantic Segmentation

Visual relationship detection, lighting estimation.

research paper of computer vision

3D Room Layouts From A Single RGB Panorama

research paper of computer vision

Scene Text Recognition

research paper of computer vision

Scene Graph Generation

research paper of computer vision

Scene Recognition

Adversarial attack.

research paper of computer vision

Backdoor Attack

research paper of computer vision

Adversarial Text

Adversarial attack detection, real-world adversarial attack, image retrieval.

research paper of computer vision

Sketch-Based Image Retrieval

research paper of computer vision

Content-Based Image Retrieval

research paper of computer vision

Composed Image Retrieval (CoIR)

research paper of computer vision

Medical Image Retrieval

Active object detection, image reconstruction.

research paper of computer vision

MRI Reconstruction

Ct reconstruction.

research paper of computer vision

Film Removal

Conformal prediction.

research paper of computer vision

Text Simplification

research paper of computer vision

Self-Supervised Image Classification

research paper of computer vision

Music Source Separation

Emotion recognition.

research paper of computer vision

Speech Emotion Recognition

research paper of computer vision

Emotion Recognition in Conversation

research paper of computer vision

Multimodal Emotion Recognition

Emotion-cause pair extraction.

research paper of computer vision

Monocular 3D Object Detection

Robust 3d object detection.

research paper of computer vision

3D Object Detection From Stereo Images

research paper of computer vision

Multiview Detection

Dimensionality reduction.

research paper of computer vision

Supervised dimensionality reduction

Online nonnegative cp decomposition, style transfer.

research paper of computer vision

Image Stylization

research paper of computer vision

Font Style Transfer

Style generalization, face transfer, optical flow estimation.

research paper of computer vision

Video Stabilization

Action localization.

research paper of computer vision

Action Segmentation

Spatio-temporal action localization, image captioning.

research paper of computer vision

3D dense captioning

Controllable image captioning, aesthetic image captioning.

research paper of computer vision

Relational Captioning

Person re-identification.

research paper of computer vision

Unsupervised Person Re-Identification

Video-based person re-identification, occluded person re-identification, generalizable person re-identification, image restoration.

research paper of computer vision

Demosaicking

Spectral reconstruction, underwater image restoration, flare removal, action detection.

research paper of computer vision

Skeleton Based Action Recognition

research paper of computer vision

Online Action Detection

Audio-visual active speaker detection, metric learning.

research paper of computer vision

10-shot image generation

research paper of computer vision

Motion Synthesis

research paper of computer vision

Community Question Answering

research paper of computer vision

Talking Head Generation

Gan image forensics, object recognition.

research paper of computer vision

3D Object Recognition

Continuous object recognition.

research paper of computer vision

Depiction Invariant Object Recognition

Image enhancement.

research paper of computer vision

Low-Light Image Enhancement

Image relighting, de-aliasing, multi-label classification.

research paper of computer vision

Missing Labels

Extreme multi-label classification, hierarchical multi-label classification, medical code prediction.

research paper of computer vision

Monocular 3D Human Pose Estimation

Pose prediction.

research paper of computer vision

3D Multi-Person Pose Estimation

3d human pose and shape estimation, continuous control.

research paper of computer vision

Steering Control

Drone controller.

research paper of computer vision

Semi-Supervised Video Object Segmentation

research paper of computer vision

Unsupervised Video Object Segmentation

research paper of computer vision

Referring Video Object Segmentation

research paper of computer vision

Video Salient Object Detection

3d face modelling.

research paper of computer vision

Trajectory Prediction

research paper of computer vision

Trajectory Forecasting

Human motion prediction, out-of-sight trajectory prediction, novel view synthesis.

research paper of computer vision

Novel LiDAR View Synthesis

research paper of computer vision

Gournd video synthesis from satellite image

research paper of computer vision

Multivariate Time Series Imputation

Image quality assessment, no-reference image quality assessment, blind image quality assessment.

research paper of computer vision

Aesthetics Quality Assessment

Stereoscopic image quality assessment, instruction following, visual instruction following.

research paper of computer vision

Blind Image Deblurring

Single-image blind deblurring, object localization.

research paper of computer vision

Weakly-Supervised Object Localization

Image-based localization, unsupervised object localization, active object localization, out-of-distribution detection, camera shot segmentation.

research paper of computer vision

Facial Inpainting

Cloud removal.

research paper of computer vision

Fine-Grained Image Inpainting

Prompt engineering.

research paper of computer vision

Visual Prompting

Change detection.

research paper of computer vision

Semi-supervised Change Detection

Image compression.

research paper of computer vision

Feature Compression

Jpeg compression artifact reduction.

research paper of computer vision

Lossy-Compression Artifact Reduction

Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, saliency detection.

research paper of computer vision

Saliency Prediction

research paper of computer vision

Co-Salient Object Detection

Video saliency detection, unsupervised saliency detection, image registration.

research paper of computer vision

Unsupervised Image Registration

Visual reasoning.

research paper of computer vision

Visual Commonsense Reasoning

Ensemble learning, salient object detection, saliency ranking, rgb-t salient object detection, visual grounding.

research paper of computer vision

3D visual grounding

Person-centric visual grounding.

research paper of computer vision

Phrase Extraction and Grounding (PEG)

Visual tracking.

research paper of computer vision

Point Tracking

Rgb-t tracking, real-time visual tracking.

research paper of computer vision

RF-based Visual Tracking

Video question answering.

research paper of computer vision

Zero-Shot Video Question Answer

Few-shot video question answering, whole slide images, image manipulation detection.

research paper of computer vision

Generalized Zero Shot skeletal action recognition

Zero shot skeletal action recognition, 3d point cloud classification.

research paper of computer vision

3D Object Classification

research paper of computer vision

Few-Shot 3D Point Cloud Classification

Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, video captioning.

research paper of computer vision

Dense Video Captioning

Boundary captioning.

research paper of computer vision

Live Video Captioning

Visual text correction, 2d classification.

research paper of computer vision

Neural Network Compression

Cell detection.

research paper of computer vision

Plant Phenotyping

Open-set classification, motion estimation, gesture recognition.

research paper of computer vision

Hand Gesture Recognition

research paper of computer vision

Hand-Gesture Recognition

research paper of computer vision

RF-based Gesture Recognition

Activity prediction, motion prediction, cyber attack detection, sequential skip prediction, text detection, point cloud registration.

research paper of computer vision

Image to Point Cloud Registration

research paper of computer vision

Robust 3D Semantic Segmentation

research paper of computer vision

Real-Time 3D Semantic Segmentation

research paper of computer vision

Unsupervised 3D Semantic Segmentation

Furniture segmentation, 3d point cloud interpolation, medical diagnosis.

research paper of computer vision

Alzheimer's Disease Detection

research paper of computer vision

Retinal OCT Disease Classification

Blood cell count, thoracic disease classification, rain removal.

research paper of computer vision

Single Image Deraining

Visual odometry.

research paper of computer vision

Face Anti-Spoofing

Monocular visual odometry.

research paper of computer vision

Hand Pose Estimation

research paper of computer vision

Hand Segmentation

Gesture-to-gesture translation.

research paper of computer vision

Image Dehazing

research paper of computer vision

Single Image Dehazing

Video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, deepfake detection.

research paper of computer vision

Synthetic Speech Detection

Human detection of deepfakes, multimodal forgery detection, robot navigation.

research paper of computer vision

Social Navigation

Pointgoal navigation.

research paper of computer vision

Sequential Place Learning

Image manipulation, image clustering.

research paper of computer vision

Online Clustering

research paper of computer vision

Face Clustering

Multi-view subspace clustering, multi-modal subspace clustering, visual localization.

research paper of computer vision

Colorization

research paper of computer vision

Line Art Colorization

research paper of computer vision

Point-interactive Image Colorization

research paper of computer vision

Color Mismatch Correction

Visual place recognition.

research paper of computer vision

Indoor Localization

3d place recognition, stereo disparity estimation, stereo matching, image editing, rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing.

research paper of computer vision

Unsupervised Image-To-Image Translation

research paper of computer vision

Synthetic-to-Real Translation

research paper of computer vision

Multimodal Unsupervised Image-To-Image Translation

research paper of computer vision

Cross-View Image-to-Image Translation

Image-to-image regression, human-object interaction detection.

research paper of computer vision

Affordance Recognition

Hand-object interaction detection, earth observation, image deblurring, low-light image deblurring and enhancement, object reconstruction.

research paper of computer vision

3D Object Reconstruction

research paper of computer vision

Crowd Counting

research paper of computer vision

Visual Crowd Analysis

Group detection in crowds, image matching.

research paper of computer vision

Semantic correspondence

Patch matching, set matching.

research paper of computer vision

Matching Disparate Images

Point cloud classification, jet tagging, few-shot point cloud classification, hyperspectral.

research paper of computer vision

Hyperspectral Image Classification

Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, referring expression, point cloud generation, point cloud completion, 3d point cloud reconstruction, document text classification, multi-label classification of biomedical texts, political salient issue orientation detection, scene classification.

research paper of computer vision

Weakly-supervised Temporal Action Localization

Weakly supervised action localization.

research paper of computer vision

Temporal Action Proposal Generation

Activity recognition in videos, 2d human pose estimation, action anticipation.

research paper of computer vision

3D Face Animation

Semi-supervised human pose estimation, reconstruction, 3d human reconstruction.

research paper of computer vision

Single-View 3D Reconstruction

4d reconstruction, single-image-based hdr reconstruction, visual navigation, objectgoal navigation, keyword spotting.

research paper of computer vision

Small-Footprint Keyword Spotting

Visual keyword spotting, compressive sensing, boundary detection.

research paper of computer vision

Junction Detection

research paper of computer vision

Motion Style Transfer

Temporal human motion composition, cross-modal retrieval, image-text matching, cross-modal retrieval with noisy correspondence, multilingual cross-modal retrieval.

research paper of computer vision

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, document ai, document understanding, scene text detection.

research paper of computer vision

Curved Text Detection

Multi-oriented scene text detection, camera calibration, image matting.

research paper of computer vision

Semantic Image Matting

Video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), video summarization.

research paper of computer vision

Unsupervised Video Summarization

Supervised video summarization, point cloud segmentation, 3d anomaly detection, video anomaly detection, artifact detection, sensor fusion, superpixels.

research paper of computer vision

Emotion Classification

research paper of computer vision

Point cloud reconstruction

research paper of computer vision

3D Semantic Scene Completion

research paper of computer vision

3D Semantic Scene Completion from a single RGB image

Garment reconstruction.

research paper of computer vision

Few-Shot Transfer Learning for Saliency Prediction

research paper of computer vision

Aerial Video Saliency Prediction

Video editing, video temporal consistency, remote sensing.

research paper of computer vision

Remote Sensing Image Classification

Change detection for remote sensing images, building change detection for remote sensing images.

research paper of computer vision

Segmentation Of Remote Sensing Imagery

research paper of computer vision

The Semantic Segmentation Of Remote Sensing Imagery

Document layout analysis.

research paper of computer vision

Camera Pose Estimation

research paper of computer vision

Cross-Domain Few-Shot Object Detection

Face generation, talking face generation.

research paper of computer vision

Face Age Editing

Facial expression generation, kinship face generation.

research paper of computer vision

Generalized Few-Shot Semantic Segmentation

Machine unlearning, continual forgetting, privacy preserving deep learning, membership inference attack, video instance segmentation.

research paper of computer vision

Virtual Try-on

Line items extraction, human detection.

research paper of computer vision

Generalized Referring Expression Segmentation

research paper of computer vision

Weakly Supervised Referring Expression Segmentation

Scene flow estimation.

research paper of computer vision

Self-supervised Scene Flow Estimation

Gait recognition.

research paper of computer vision

Multiview Gait Recognition

Gait recognition in the wild, motion forecasting.

research paper of computer vision

Multi-Person Pose forecasting

research paper of computer vision

Multiple Object Forecasting

Dataset distillation, depth completion.

research paper of computer vision

Object Discovery

Texture synthesis, carla map leaderboard, dead-reckoning prediction, 3d classification, multi-view learning, incomplete multi-view clustering, interactive segmentation, cross-modal alignment.

research paper of computer vision

3D Hand Pose Estimation

Gaze estimation.

research paper of computer vision

Scene Generation

Face reconstruction.

research paper of computer vision

text-guided-image-editing

Text-based image editing, concept alignment.

research paper of computer vision

Zero-Shot Text-to-Image Generation

Conditional text-to-image synthesis, image recognition, fine-grained image recognition, license plate recognition, material recognition, object counting, few-shot object counting and detection, open-vocabulary object counting, training-free object counting.

research paper of computer vision

Breast Cancer Detection

Skin cancer classification.

research paper of computer vision

Breast Cancer Histology Image Classification

Lung cancer diagnosis, classification of breast cancer histology images, sign language recognition.

research paper of computer vision

Event-based vision

research paper of computer vision

Event-based Optical Flow

research paper of computer vision

Event-Based Video Reconstruction

Event-based motion estimation, inverse rendering, 3d absolute human pose estimation.

research paper of computer vision

Image to 3D

research paper of computer vision

Text-to-Face Generation

Interest point detection, homography estimation, disease prediction, disease trajectory forecasting, human parsing.

research paper of computer vision

Multi-Human Parsing

Pose tracking.

research paper of computer vision

3D Human Pose Tracking

Weakly supervised segmentation, text-to-video generation, text-to-video editing, subject-driven video generation.

research paper of computer vision

3D Multi-Person Pose Estimation (absolute)

research paper of computer vision

3D Multi-Person Mesh Recovery

research paper of computer vision

3D Multi-Person Pose Estimation (root-relative)

research paper of computer vision

Dichotomous Image Segmentation

Scene segmentation, multi-label image classification.

research paper of computer vision

Multi-label Image Recognition with Partial Labels

Facial landmark detection.

research paper of computer vision

Unsupervised Facial Landmark Detection

research paper of computer vision

3D Facial Landmark Localization

3d character animation from a single photo, activity detection, temporal localization.

research paper of computer vision

Language-Based Temporal Localization

Temporal defect localization, lidar semantic segmentation, 3d object tracking.

research paper of computer vision

3D Single Object Tracking

Camera localization.

research paper of computer vision

Camera Relocalization

Knowledge distillation.

research paper of computer vision

Data-free Knowledge Distillation

Self-knowledge distillation, few-shot class-incremental learning, class-incremental semantic segmentation, non-exemplar-based class incremental learning, moment retrieval.

research paper of computer vision

Zero-shot Moment Retrieval

Template matching, disparity estimation.

research paper of computer vision

Multimodal Large Language Model

Intelligent surveillance.

research paper of computer vision

Vehicle Re-Identification

Motion segmentation, relation network, visual dialog.

research paper of computer vision

Handwritten Text Recognition

Handwritten document recognition, unsupervised text recognition, text spotting.

research paper of computer vision

Decision Making Under Uncertainty

research paper of computer vision

Uncertainty Visualization

Text to video retrieval, partially relevant video retrieval.

research paper of computer vision

3D Multi-Object Tracking

Real-time multi-object tracking, referring multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, shadow detection.

research paper of computer vision

Shadow Detection And Removal

Person search, semi-supervised object detection.

research paper of computer vision

Video Enhancement

Zero shot segmentation, video inpainting.

research paper of computer vision

Mixed Reality

Physics-informed machine learning, soil moisture estimation.

research paper of computer vision

Unconstrained Lip-synchronization

Human mesh recovery, open vocabulary semantic segmentation, zero-guidance segmentation, future prediction.

research paper of computer vision

Cross-corpus

Micro-expression recognition, micro-expression spotting.

research paper of computer vision

3D Facial Expression Recognition

research paper of computer vision

Smile Recognition

Overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 15-1, disjoint 15-5.

research paper of computer vision

Face Image Quality Assessment

Lightweight face recognition.

research paper of computer vision

Age-Invariant Face Recognition

Synthetic face recognition, face quality assessement.

research paper of computer vision

Stereo Image Super-Resolution

Burst image super-resolution, satellite image super-resolution, multispectral image super-resolution, video reconstruction.

research paper of computer vision

Image Categorization

Fine-grained visual categorization, key information extraction, key-value pair extraction, sign language translation.

research paper of computer vision

Color Constancy

research paper of computer vision

Few-Shot Camera-Adaptive Color Constancy

Line detection, tone mapping, visual recognition.

research paper of computer vision

Fine-Grained Visual Recognition

Hdr reconstruction, multi-exposure image fusion, deep attention, image cropping, stereo matching hand.

research paper of computer vision

Zero-Shot Action Recognition

Breast cancer histology image classification (20% labels), natural language transduction, video restoration.

research paper of computer vision

Analog Video Restoration

Image forensics, infrared and visible image fusion.

research paper of computer vision

Novel Class Discovery

research paper of computer vision

Image Animation

research paper of computer vision

Landmark-based Lipreading

Abnormal event detection in video.

research paper of computer vision

Semi-supervised Anomaly Detection

Cross-domain few-shot learning, vision-language navigation.

research paper of computer vision

Grasp Generation

research paper of computer vision

hand-object pose

research paper of computer vision

3D Canonical Hand Pose Estimation

Transparent object detection, transparent objects, action quality assessment, object segmentation.

research paper of computer vision

Camouflaged Object Segmentation

Landslide segmentation, text-line extraction, surface normals estimation.

research paper of computer vision

Highlight Detection

Pedestrian attribute recognition.

research paper of computer vision

Probabilistic Deep Learning

Segmentation, open-vocabulary semantic segmentation, steganalysis, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography.

research paper of computer vision

Camouflaged Object Segmentation with a Single Task-generic Prompt

Dense captioning, texture classification, iris recognition, pupil dilation, image to video generation.

research paper of computer vision

Unconditional Video Generation

Action understanding, person retrieval, spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, unsupervised few-shot image classification, generalized few-shot classification.

research paper of computer vision

Sketch Recognition

research paper of computer vision

Face Sketch Synthesis

Drawing pictures.

research paper of computer vision

Photo-To-Caricature Translation

Meme classification, hateful meme classification.

research paper of computer vision

Unbiased Scene Graph Generation

research paper of computer vision

Panoptic Scene Graph Generation

Severity prediction, intubation support prediction, image stitching.

research paper of computer vision

Multi-View 3D Reconstruction

Surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, document image classification.

research paper of computer vision

One-shot visual object segmentation

Universal domain adaptation, zero-shot semantic segmentation, automatic post-editing.

research paper of computer vision

Face Reenactment

research paper of computer vision

Text based Person Retrieval

Text-to-image, story visualization, complex scene breaking and synthesis, human dynamics.

research paper of computer vision

3D Human Dynamics

Image fusion, pansharpening, blind face restoration.

research paper of computer vision

Cloud Detection

research paper of computer vision

Geometric Matching

Human action generation.

research paper of computer vision

Action Generation

Object categorization, table recognition, point clouds, point cloud video understanding, cross-modal place recognition, point cloud rrepresentation learning, diffusion personalization.

research paper of computer vision

Diffusion Personalization Tuning Free

research paper of computer vision

Efficient Diffusion Personalization

Image deconvolution.

research paper of computer vision

Image Outpainting

research paper of computer vision

Sports Analytics

Image shadow removal, intrinsic image decomposition, single-source domain generalization, evolving domain generalization, source-free domain generalization.

research paper of computer vision

Semantic SLAM

research paper of computer vision

Object SLAM

Image steganography, lane detection.

research paper of computer vision

3D Lane Detection

Line segment detection, person identification, visual prompt tuning, situation recognition, grounded situation recognition, face image quality, layout design, multi-target domain adaptation, weakly-supervised instance segmentation, fake image detection.

research paper of computer vision

Fake Image Attribution

research paper of computer vision

Robot Pose Estimation

Image morphing, motion detection, occlusion handling, rotated mnist, image smoothing, drone navigation, drone-view target localization, contour detection.

research paper of computer vision

Crop Classification

License plate detection.

research paper of computer vision

Video Panoptic Segmentation

Value prediction, body mass index (bmi) prediction, crop yield prediction, personalized image generation, viewpoint estimation.

research paper of computer vision

motion retargeting

3d point cloud linear classification, gaze prediction, multi-object tracking and segmentation.

research paper of computer vision

Multiview Learning

research paper of computer vision

Document Shadow Removal

Zero-shot transfer image classification.

research paper of computer vision

3D Object Reconstruction From A Single Image

research paper of computer vision

CAD Reconstruction

Bird's-eye view semantic segmentation.

research paper of computer vision

Zero-Shot Composed Image Retrieval (ZS-CIR)

Human part segmentation.

research paper of computer vision

Material Classification

research paper of computer vision

Person Recognition

research paper of computer vision

Photo Retouching

Space-time video super-resolution, symmetry detection, shape representation of 3d point clouds, dense pixel correspondence estimation, image forgery detection, image similarity search.

research paper of computer vision

Multispectral Object Detection

Precipitation forecasting, referring expression generation, synthetic image detection, traffic sign detection, video style transfer, referring image matting.

research paper of computer vision

Referring Image Matting (Expression-based)

research paper of computer vision

Referring Image Matting (Keyword-based)

research paper of computer vision

Referring Image Matting (RefMatte-RW100)

Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, semi-supervised image classification.

research paper of computer vision

Open-World Semi-Supervised Learning

Semi-supervised image classification (cold start), affordance detection.

research paper of computer vision

Hand Detection

Image instance retrieval, amodal instance segmentation, image quality estimation.

research paper of computer vision

Road Damage Detection

research paper of computer vision

Video Matting

research paper of computer vision

inverse tone mapping

Art analysis, facial editing.

research paper of computer vision

Holdout Set

research paper of computer vision

Open Vocabulary Attribute Detection

Binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, stable mci vs progressive mci, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, image/document clustering, self-organized clustering, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, 3d shape modeling.

research paper of computer vision

Action Analysis

Anatomical landmark detection, event segmentation, generic event boundary detection, food recognition.

research paper of computer vision

Motion Magnification

Scanpath prediction, semi-supervised instance segmentation, video deraining, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, 2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, deception detection, deception detection in videos, instance search.

research paper of computer vision

Audio Fingerprint

Lung nodule classification, lung nodule 3d classification, image comprehension, image manipulation localization, image retouching, image-variation, jpeg artifact removal, point cloud super resolution, pose retrieval, short-term object interaction anticipation, skills assessment.

research paper of computer vision

Text-based Person Retrieval

research paper of computer vision

Sensor Modeling

Highlight removal, handwriting verification, bangla spelling error correction, video prediction, earth surface forecasting, predict future video frames.

research paper of computer vision

Video Visual Relation Detection

Human-object relationship detection, 3d open-vocabulary instance segmentation.

research paper of computer vision

Ad-hoc video search

Audio-visual synchronization, handwriting generation, network interpretation, scene change detection.

research paper of computer vision

Semi-Supervised Domain Generalization

Sketch-to-image translation, skills evaluation, unsupervised semantic segmentation.

research paper of computer vision

Unsupervised Semantic Segmentation with Language-image Pre-training

3d shape reconstruction from a single 2d image.

research paper of computer vision

Shape from Texture

3d shape representation.

research paper of computer vision

3D Dense Shape Correspondence

Birds eye view object detection, event data classification, few-shot instance segmentation, multiple people tracking.

research paper of computer vision

Open Vocabulary Panoptic Segmentation

Rgb-d reconstruction, seeing beyond the visible, single-object discovery.

research paper of computer vision

Sequential Place Recognition

Autonomous flight (dense forest), autonomous web navigation, vietnamese visual question answering, explanatory visual question answering, multiple object tracking with transformer.

research paper of computer vision

Multiple Object Track and Segmentation

Constrained lip-synchronization, face dubbing, 2d semantic segmentation task 3 (25 classes), document enhancement, 3d shape reconstruction, 4d panoptic segmentation, defocus blur detection, face anonymization, font recognition, horizon line estimation, instance shadow detection, kinship verification, medical image enhancement, spatio-temporal video grounding, training-free 3d point cloud classification, video forensics.

research paper of computer vision

Generative 3D Object Classification

Cube engraving classification, enf (electric network frequency) extraction, enf (electric network frequency) extraction from video, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, multimodal machine translation.

research paper of computer vision

Face to Face Translation

Multimodal lexical translation, 3d scene editing, action assessment, bokeh effect rendering, drivable area detection, stochastic human motion prediction, image imputation.

research paper of computer vision

Long Video Retrieval (Background Removed)

Medical image denoising.

research paper of computer vision

Mirror Detection

Occlusion estimation, physiological computing.

research paper of computer vision

Lake Ice Monitoring

Text-based person retrieval with noisy correspondence.

research paper of computer vision

Unsupervised 3D Point Cloud Linear Evaluation

Visual speech recognition, lip to speech synthesis, wireframe parsing, gaze redirection, single-image-generation, text-guided-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.

research paper of computer vision

Unsupervised Contextual Anomaly Detection

Mistake detection, online mistake detection, 3d object captioning, 3d semantic occupancy prediction, animated gif generation.

research paper of computer vision

Occluded Face Detection

Generalized referring expression comprehension, image colorization, sketch colorization, image deblocking, image retargeting, infrared image super-resolution, motion disentanglement, online vectorized hd map construction, personality trait recognition, personalized segmentation, persuasion strategies, scene text editing, image to sketch recognition, spatial relation recognition, traffic accident detection, accident anticipation, unsupervised landmark detection, vcgbench-diverse, vehicle speed estimation, visual analogies, continual anomaly detection.

research paper of computer vision

Human-Object Interaction Generation

Image-guided composition, noisy semantic image synthesis, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount, continual semantic segmentation, overlapped 5-3, overlapped 25-25.

research paper of computer vision

Handwritten Line Segmentation

Handwritten word segmentation.

research paper of computer vision

General Action Video Anomaly Detection

Physical video anomaly detection, road scene understanding, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).

research paper of computer vision

Transparent Object Depth Estimation

Age and gender estimation, data ablation, fingertip detection, gait identification, historical color image dating, image and video forgery detection, keypoint detection and image matching, marine animal segmentation, motion captioning, part-aware panoptic segmentation.

research paper of computer vision

Part-based Representation Learning

Unsupervised part discovery, portrait animation, repetitive action counting, scene-aware dialogue, spatial token mixer, steganographics, story continuation.

research paper of computer vision

Supervised Image Retrieval

Unsupervised anomaly detection with specified settings -- 0.1% anomaly, unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, visual social relationship recognition, zero-shot text-to-video generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.

research paper of computer vision

eXtreme-Video-Frame-Interpolation

Micro-expression generation, micro-expression generation (megc2021), period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 2d tiny object detection.

research paper of computer vision

Insulator Defect Detection

3d rotation estimation, camera auto-calibration, defocus estimation, derendering, grounded multimodal named entity recognition, hierarchical text segmentation, human-object interaction concept discovery.

research paper of computer vision

One-Shot Face Stylization

Speaker-specific lip to speech synthesis, multi-person pose estimation, multi-modal image segmentation, neural stylization.

research paper of computer vision

Population Mapping

Pornography detection, prediction of occupancy grid maps, raw reconstruction, svbrdf estimation, semi-supervised video classification, spectrum cartography, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video individual counting, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri, fashion understanding, semi-supervised fashion compatibility.

research paper of computer vision

intensity image denoising

Lifetime image denoising, observation completion, active observation completion, boundary grounding.

research paper of computer vision

Video Narrative Grounding

3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.

research paper of computer vision

Age Estimation

research paper of computer vision

Few-shot Age Estimation

Animal action recognition, cow identification, brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.

research paper of computer vision

Disguised Face Verification

Few shot open set object detection, fine-grained vehicle classification, vehicle color recognition, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.

research paper of computer vision

Multi-object discovery

Neural radiance caching.

research paper of computer vision

Parking Space Occupancy

research paper of computer vision

Partial Video Copy Detection

research paper of computer vision

Multimodal Patch Matching

Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, safety perception recognition, jersey number recognition, photo to rest generalization, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, specular reflection mitigation, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.

research paper of computer vision

Video-Adverb Retrieval (Unseen Compositions)

Video-to-image affordance grounding.

research paper of computer vision

Vietnamese Scene Text

Visual sentiment prediction, human-scene contact detection, localization in video forgery, controllable grasp generation, grasp rectangle generation, video classification, student engagement level detection (four class video classification), multi class classification (four-level video classification), 3d canonicalization, 3d surface generation.

research paper of computer vision

Visibility Estimation from Point Cloud

Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, disjoint 19-1, document image skew estimation, earthquake prediction, fashion compatibility learning.

research paper of computer vision

Displaced People Recognition

Finger vein recognition, flooded building segmentation.

research paper of computer vision

Future Hand Prediction

Generative temporal nursing, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.

research paper of computer vision

Image Text Removal

Image-to-gps verification.

research paper of computer vision

Image-based Automatic Meter Reading

Dial meter reading, indoor scene reconstruction, jpeg decompression.

research paper of computer vision

Kiss Detection

Laminar-turbulent flow localisation.

research paper of computer vision

Landmark Recognition

Brain landmark detection, corpus video moment retrieval, linear probing object-level 3d awareness, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, one-shot segmentation.

research paper of computer vision

Patient-Specific Segmentation

Open set video captioning, open-vocabulary panoramic semantic segmentation, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.

research paper of computer vision

Partially View-aligned Multi-view Learning

research paper of computer vision

Pedestrian Detection

research paper of computer vision

Thermal Infrared Pedestrian Detection

Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, procedure step recognition, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, pupil diameter estimation, reference expression generation, interspecies facial keypoint transfer, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.

research paper of computer vision

Transform A Video Into A Comics

Transparency separation, typeface completion.

research paper of computer vision

Unbalanced Segmentation

research paper of computer vision

Unsupervised Long Term Person Re-Identification

Video correspondence flow.

research paper of computer vision

Key-Frame-based Video Super-Resolution (K = 15)

Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.

research paper of computer vision

Image Operation Chain Detection

Kinematic based workflow recognition, logo recognition.

research paper of computer vision

MLLM Aesthetic Evaluation

Motion detection in non-stationary scenes, open-set video tagging, retinal vessel segmentation.

research paper of computer vision

Artery/Veins Retinal Vessel Segmentation

Satellite orbit determination.

research paper of computer vision

Segmentation Based Workflow Recognition

2d particle picking, small object detection.

research paper of computer vision

Rice Grain Disease Detection

Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.

  • Computer Science and Engineering
  • Artificial Intelligence
  • Computer Vision

Computer Vision and Image Processing: A Paper Review

  • February 2018
  • International Journal of Artificial Intelligence Research 2(1):22
  • CC BY-SA 4.0
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

Segmentation stages

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • MOBILE NETW APPL

Esubalew Chekol

  • Belaynesh Gashaw Arega

Farah Chouikhi

  • Nhell Heder Cerna Velazco

Félix Napoleón Díaz Desposorio

  • Rafael Ángel Liza Neciosup
  • Jessica Beatriz Toribio Calero
  • Yipeng Chao
  • Zonghua Wang

Emmanuel Agung Nugroho

  • Munadi Munadi
  • Yashika Khurana
  • Shruti Gupta
  • Sedigheh Esfahani
  • Michele Cotrufo
  • M. Ezhilvandan
  • S. Gayathri
  • Siva Subramanian R.

Jonathan Anthony Atkinson

  • SENSORS-BASEL

Meng Guo

  • Comput Intell Neurosci

HaiZhou Wu

  • Yongquan Zhou

Mohamed Abdel-Basset

  • Xiaogang Wang

Arif Khan

  • Vivek Nandakumar

Nanna Hansen

  • Honor L. Glenn

Deirdre R Meldrum

  • Nazmul Siddique
  • Hojjat Adeli

Kamal Nasrollahi

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.

louisfb01/top-10-cv-papers-2021

Folders and files.

NameName
5 Commits

Repository files navigation

The top 10 computer vision papers of 2021, the top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference..

While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness

Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!

The complete reference to each paper is listed at the end of this repository.

Maintainer: louisfb01

Subscribe to my newsletter - The latest updates in AI explained every week.

Feel free to message me any interesting paper I may have missed to add to this repository.

Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!

Watch the 2021 CV rewind

research paper of computer vision

Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review

👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.

👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)

🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!

Open In Colab

If you are interested in AI research, here is another great repository for you:

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

2021: A Year Full of Amazing AI papers- A Review

The Full List

Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus].

  • Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

Total Relighting: Learning to Relight Portraits for Background Replacement [5]

  • Animating Pictures with Eulerian Motion Fields [6]
  • CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

TimeLens: Event-based Video Frame Interpolation [8]

  • (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
  • CityNeRF: Building NeRF at City Scale [10]

Paper references

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

research paper of computer vision

  • Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
  • Paper: Zero-Shot Text-to-Image Generation
  • Code: Code & more information for the discrete VAE used for DALL·E

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

research paper of computer vision

  • Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
  • Paper: Taming Transformers for High-Resolution Image Synthesis
  • Code: Taming Transformers

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

research paper of computer vision

  • Short read: Will Transformers Replace CNNs in Computer Vision?
  • Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Click here for the code

"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."

research paper of computer vision

  • Short read: What is the state of AI in computer vision?
  • Paper: Deep nets: What have they ever done for vision?

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

research paper of computer vision

  • Short read: Infinite Nature: Fly into an image and explore the landscape
  • Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

research paper of computer vision

  • Short read: Realistic Lighting on Different Backgrounds
  • Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

Animating Pictures with Eulerian Motion Fields [6]

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...

research paper of computer vision

  • Short read: Create Realistic Animated Looping Videos from Pictures
  • Paper: Animating Pictures with Eulerian Motion Fields

CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

research paper of computer vision

  • Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
  • Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

research paper of computer vision

  • Short read: How to Make Slow Motion Videos With AI!
  • Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

research paper of computer vision

  • Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
  • Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
  • Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
  • CLIPDraw Colab demo
  • StyleCLIPDraw Colab demo

CityNeRF: Building NeRF at City Scale [10]

The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?

research paper of computer vision

  • Short read: CityNeRF: 3D Modelling at City Scale!
  • Paper: CityNeRF: Building NeRF at City Scale
  • Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.

[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

Sponsor this project

Top Computer Vision Papers of All Time (Updated 2024)

  • Nico Klingler
  • March 12, 2024

research paper of computer vision

Build, deploy, operate computer vision at scale

  • One platform for all use cases
  • Connect all your cameras
  • Flexible for your needs

Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.

In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.

Gradient-based Learning Applied to Document Recognition (1998)

Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).

About us:  Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.

Classic Computer Vision Papers

The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.

LeNet CNN architecture digits recognition

Characteristics of the model:

  • LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
  • The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
  • The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).

Find the LeNet paper here .

David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.

SIFT method keypoints detection

Model characteristics:

  • The method generates large numbers of features that densely cover the image over the full range of scales and locations.
  • The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
  • For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
  • SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).

Find the SIFT paper here .

The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .

histogram object detection

Authors achievements:

  • The histogram method gave near-perfect separation from the original MIT pedestrian database.
  • For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
  • Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
  • In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.

Find the HOG paper here .

Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.

surf detecting interest points

  • Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
  • Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
  • SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).

Find the SURF paper here .

Papers Based on Deep-Learning Models

Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.

alexnet CNN architecture

  • The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
  • They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
  • The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
  • After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.

Find the ImageNet paper here .

Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

 image classification CNN results VOC-2007, VOC-2012

  • Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
  • They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
  • They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.

Find the VGG paper here .

The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.

GoogleNet Inception CNN

  • A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
  • Their submission for ILSVRC14 was called GoogLeNet , a 22-layer deep network. Its quality was assessed in the context of classification and detection.
  • They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
  • Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.

Find the GoogLeNet paper here .

Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.

resnet error rates

  • They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
  • This result won 1st place on the ILSVRC 2015 classification task.
  • The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
  • Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.

Find the ResNet paper here .

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.

faster R-CNN object detection

  • Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
  • For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
  • Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
  • In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.

Find the Faster R-CNN paper here .

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .

YOLO CNN architecture

  • The base YOLO model processed images in real-time at 45 frames per second.
  • A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
  • Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
  • YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN , when generalizing natural images.

Find the YOLO paper here .

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

mask R-CNN framework

  • Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
  • Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
  • Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
  • The model served as a solid baseline and eased future research in instance-level recognition.

Find the Mask R-CNN paper here .

The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .

efficiennet model scaling CNN

  • Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
  • EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
  • It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.

Find the EfficientNet paper here .

All-in-one platform to build, deploy, and scale computer vision applications

research paper of computer vision

  • Stacked scrolling 1 Platform
  • Certificate check Solutions
  • Blog Viso Blog
  • Forum Contact us

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-advertisement1 yearSet by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementorneverThis cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKENsessionThis cookie is used to distinguish between humans and bots.
zfccnsessionZoho sets this cookie for website security when a request is sent to campaigns.
CookieDurationDescription
_zcsr_tmpsessionZoho sets this cookie for the login function on the website.
CookieDurationDescription
_gat1 minuteThis cookie is installed by Google Universal Analytics to restrain request rate and thus limit the collection of data on high traffic sites.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId1 yearThis cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitIdone yearUsed for identifying returning visits of users to the webpage.
zft-sdc24hoursIt records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts1 yearThese cookies are used to measure and analyze the traffic of this website and expire in 1 year.
CookieDurationDescription
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
CookieDurationDescription
2d719b1dd3sessionThis cookie has not yet been given a description. Our team is working to provide more information.
4662279173sessionThis cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645sessionThis cookie has not yet been given a description. Our team is working to provide more information.
zc_consent1 yearNo description available.
zc_show1 yearNo description available.
zsc2feeae1d12f14395b6d5128904ae37461 minuteThis cookie has not yet been given a description. Our team is working to provide more information.

research paper of computer vision

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 08 January 2021

Deep learning-enabled medical computer vision

  • Andre Esteva   ORCID: orcid.org/0000-0003-1937-9682 1 ,
  • Katherine Chou 2   na1 ,
  • Serena Yeung 3   na1 ,
  • Nikhil Naik   ORCID: orcid.org/0000-0002-5191-2726 1   na1 ,
  • Ali Madani 1   na1 ,
  • Ali Mottaghi 3   na1 ,
  • Yun Liu   ORCID: orcid.org/0000-0003-4079-8275 2 ,
  • Eric Topol 4 ,
  • Jeff Dean 2 &
  • Richard Socher 1  

npj Digital Medicine volume  4 , Article number:  5 ( 2021 ) Cite this article

96k Accesses

562 Citations

312 Altmetric

Metrics details

  • Computational science
  • Health care
  • Medical research

A decade of unprecedented progress in artificial intelligence (AI) has demonstrated the potential for many fields—including medicine—to benefit from the insights that AI techniques can extract from data. Here we survey recent progress in the development of modern computer vision techniques—powered by deep learning—for medical applications, focusing on medical imaging, medical video, and clinical deployment. We start by briefly summarizing a decade of progress in convolutional neural networks, including the vision tasks they enable, in the context of healthcare. Next, we discuss several example medical imaging applications that stand to benefit—including cardiology, pathology, dermatology, ophthalmology–and propose new avenues for continued work. We then expand into general medical video, highlighting ways in which clinical workflows can integrate computer vision to enhance care. Finally, we discuss the challenges and hurdles required for real-world clinical deployment of these technologies.

Similar content being viewed by others

research paper of computer vision

Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis

research paper of computer vision

Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions

research paper of computer vision

Applications of artificial intelligence in cardiovascular imaging

Introduction.

Computer vision (CV) has a rich history spanning decades 1 of efforts to enable computers to perceive visual stimuli meaningfully. Machine perception spans a range of levels, from low-level tasks such as identifying edges, to high-level tasks such as understanding complete scenes. Advances in the last decade have largely been due to three factors: (1) the maturation of deep learning (DL)—a type of machine learning that enables end-to-end learning of very complex functions from raw data 2 (2) strides in localized compute power via GPUs 3 , and (3) the open-sourcing of large labeled datasets with which to train these algorithms 4 . The combination of these three elements has enabled individual researchers the resource access needed to advance the field. As the research community grew exponentially, so did progress.

The growth of modern CV has overlapped with the generation of large amounts of digital data in a number of scientific fields. Recent medical advances have been prolific 5 , 6 , owing largely to DL’s remarkable ability to learn many tasks from most data sources. Using large datasets, CV models can acquire many pattern-recognition abilities—from physician-level diagnostics 7 to medical scene perception 8 . See Fig. 1 .

figure 1

a Multimodal discriminative model. Deep learning architectures can be constructed to jointly learn from both image data, typically with convolutional networks, and non-image data, typically with general deep networks. Learned annotations can include disease diagnostics, prognostics, clinical predictions, and combinations thereof. b Generative model. Convolutional neural networks can be trained to generate images. Tasks include image-to-image regression (shown), super-resolution image enhancement, novel image generation, and others.

Here we survey the intersection of CV and medicine, focusing on research in medical imaging, medical video, and real clinical deployment. We discuss key algorithmic capabilities which unlocked these opportunities, and dive into the myriad of accomplishments from recent years. The clinical tasks suitable for CV span many categories, such as screening, diagnosis, detecting conditions, predicting future outcomes, segmenting pathologies from organs to cells, monitoring disease, and clinical research. Throughout, we consider the future growth of this technology and its implications for medicine and healthcare.

Computer vision

Object classification, localization, and detection, respectively refer to identifying the type of an object in an image, the location of objects present, and both type and location simultaneously. The ImageNet Large-Scale Visual Recognition Challenge 9 (ILSVRC) was a spearhead to progress in these tasks over the last decade. It created a large community of DL researchers competing and collaborating together to improve techniques on various CV tasks. The first contemporary, GPU-powered DL approach, in 2012 10 , yielded an inflection point in the growth of this community, heralding an era of significant year-over-year improvements 11 , 12 , 13 , 14 through the competition’s final year in 2017. Notably, classification accuracy achieved human-level performance during this period. Within medicine, fine-grained versions of these methods 15 have successfully been applied to the classification and detection of many diseases (Fig. 2 ). Given sufficient data, the accuracy often matches or surpasses the level of expert physicians 7 , 16 . Similarly, the segmentation of objects has substantially improved 17 , 18 , particularly in challenging scenarios such as the biomedical segmentation of multiple types of overlapping cells in microscopy. The key DL technique leveraged in these tasks is the convolutional neural network 19 (CNN)—a type of DL algorithm which hardcodes translational invariance, a key feature of image data. Many other CV tasks have benefited from this progress, including image registration (identifying corresponding points across similar images), image retrieval (finding similar images), and image reconstruction and enhancement. The specific challenges of working with medical data require the utilization of many types of AI models.

figure 2

CNNs—trained to classify disease states—have been extensively tested across diseases, and benchmarked against physicians. Their performance is typically on par with experts when both are tested on the same image classification task. a Dermatology 7 and b Radiology 156 . Examples reprinted with permission and adapted for style.

These techniques largely rely on supervised learning, which leverages datasets that contain both data points (e.g. images) and data labels (e.g. object classes). Given the sparsity and access difficulties of medical data, transfer learning—in which an algorithm is first trained on a large and unrelated corpus (e.g. ImageNet 4 ), then fine-tuned on a dataset of interest (e.g. medical)—has been critical for progress. To reduce the costs associated with collecting and labeling data, techniques to generate synthetic data, such as data augmentation 20 and generative adversarial networks (GANs) 21 are being developed. Researchers have even shown that crowd-sourcing image annotations can yield effective medical algorithms 22 , 23 . Recently, self-supervised learning 24 —in which implicit labels are extracted from data points and used to train algorithms (e.g predicting the spatial arrangement of tiles generated from splitting an image into pieces)—have pushed the field towards fully unsupervised learning, which lacks the need for labels. Applying these techniques in medicine will reduce the barrier to development and deployment.

Medical data access is central to this field, and key ethical and legal questions must be addressed. Do patients own their de-identified data? What if methods to re-identify data improve over time? Should the community open-source large quantities of data? To date, academia and industry have largely relied on small, open-source datasets, and data collected through commercial products. Dynamics around data sharing and country-specific availability will impact deployment opportunities. The field of federated learning 25 —in which centralized algorithms can be trained on distributed data that never leaves protected enclosures—may enable a workaround in stricter jurisdictions.

These advances have spurred growth in other domains of CV, such as multimodal learning, which combines vision with other modalities such as language (Fig. 1a ) 26 , time-series data, and genomic data 5 . These methods can combine with 3D vision 27 , 28 to turn depth-cameras into privacy-preserving sensors 29 , making deployment easier for patient settings such as the intensive care unit 8 . The range of tasks is even broader in video. Applications like activity recognition 30 and live scene understanding 31 are useful in detecting and responding to important or adverse clinical events 32 .

Medical imaging

In recent years the number of publications applying computer vision techniques to static medical imagery has grown from hundreds to thousands 33 . A few areas have received substantial attention—radiology, pathology, ophthalmology, and dermatology—owing to the visual pattern-recognition nature of diagnostic tasks in these specialities, and the growing availability of highly structured images.

The unique characteristics of medical imagery pose a number of challenges to DL-based computer vision. For one, images can be massive. Digitizing histopathology slides produces gigapixel images of around 100,000 ×100,000 pixels, whereas typical CNN image inputs are around 200 ×200 pixels. Further, different chemical preparations will render different slides for the same piece of tissue, and different digitization devices or settings may produce different images for the same slide. Radiology modalities such as CT and MRI render equally massive 3D images, forcing standard CNNs to either work with a set of 2D slices, or adjust their internal structure to process in 3D. Similarly, ultrasound renders a time-series of noisy 2D slices of a 3D context–slices which are spatially correlated but not aligned. DL has started to account for the unique challenges of medical data. For instance, multiple-instance-learning (MIL) 34 enables learning from datasets containing massive images and few labels (e.g. histopathology). 3D convolutions in CNNs are enabling better learning from 3D volumes (e.g MRI and CT) 35 . Spatio-temporal models 36 and image registration enable working with time-series images (e.g. ultrasound).

Dozens of companies have obtained US FDA and European CE approval for medical imaging AI 37 , and commercial markets have begun to form as sustainable business models are created. For instance, regions of high-throughput healthcare, such as India and Thailand, have welcomed the deployment of technologies such as diabetic retinopathy screening systems 38 . This rapid growth has now reached the point of directly impacting patient outcomes—the US CMS recently approved reimbursement for a radiology stroke triage use-case which reduces the time it takes for patients to receive treatment 39 .

CV in medical modalities with non-standardized data collection requires the integration of CV into existing physical systems. For instance, in otolaryngology, CNNs can be used to help primary care physicians manage patients’ ears, nose, and throat 40 , through mountable devices attached to smartphones 41 . Hematology and serology can benefit from microscope-integrated AIs 42 that diagnose common conditions 43 or count blood cells of various types 44 —repetitive tasks that are easy to augment with CNNs. AI in gastroenterology has demonstrated stunning capabilities. Video-based CNNs can be integrated into endoscopic procedures 45 for scope guidance, lesion detection, and lesion diagnosis. Applications include esophageal cancer screening 46 , detecting gastric cancer 47 , 48 , detecting stomach infections such as H. Pylori 49 , and even finding hookworms 50 . Scientists have taken this field one step further by building entire medical AI devices designed for monitoring, such as at-home smart toilets outfitted with diagnostic CNNs on cameras 51 . Beyond the analysis of disease states, CV can serve the future of human health and welfare through applications such as screening human embryos for implantation 52 .

Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53 , 54 , 55 that extends into all modalities, with a focus on X-rays, CT, and MRI. Chest X-ray analysis—a key clinical focus area 33 —has been an exemplar. The field has collected nearly 1 million annotated, open-source images 56 , 57 , 58 —the closest ImageNet 9 equivalent to date in medical CV. Analysis of brain imagery 59 (particularly for time-critical use-cases like stroke), and abdominal imagery 60 have similarly received substantial attention. Disease classification, nodule detection 61 , and region segmentation (e.g. ventricular 62 ) models have been developed for most conditions for which data can be collected. This has enabled the field to respond rapidly in times of crisis—for instance, developing and deploying COVID-19 detection models 63 . The field continues to expand with work in image translation (e.g. converting noisy ultrasound images into MRI), image reconstruction and enhancement (e.g. converting low-dosage, low-resolution CT images into high-resolution images 64 ), automated report generation, and temporal tracking (e.g. image registration to track tumor growth over time). In the sections below, we explore vision-based applications in other specialties.

Cardiac imaging is increasingly used in a wide array of clinical diagnoses and workflows. Key clinical applications for deep learning include diagnosis and screening. The most common imaging modality in cardiovascular medicine is the cardiac ultrasound, or echocardiogram. As a cost-effective, radiation-free technique, echocardiography is uniquely suited for DL due to straightforward data acquisition and interpretation—it is routinely used in most acute inpatient facilities, outpatient centers, and emergency rooms 65 . Further, 3D imaging techniques such as CT and MRI are used for the understanding of cardiac anatomy and to better characterize supply-demand mismatch. CT segmentation algorithms have even been FDA—cleared for coronary artery visualization 66 .

There are many example applications. DL can be trained on a large database of echocardiographic studies and surpass the performance of board-certified echocardiographers in view classification 67 . Computational DL pipelines can assess hypertrophic cardiomyopathy, cardiac amyloid, and pulmonary arterial hypertension 68 . EchoNet 69 —a deep learning model that can recognize cardiac structures, estimate function, and predict systemic phenotypes that are not readily identifiable to human interpretation—has recently furthered the field.

To account for challenges around data access, 70 data-efficient echocardiogram algorithms 70 have been developed, such as semi-supervised GANs that are effective at downstream tasks (e.g predicting left ventricular hypertrophy). To account for the fact that most studies utilize privately held medical imaging datasets, 10,000 annotated echocardiogram videos were recently open-sourced 36 . Alongside this release, a video-based model, EchoNet-Dynamic 36 , was developed. It can estimate ejection fraction and assess cardiomyopathy, alongside a comprehensive evaluation criterion based on results from an external dataset and human experts.

Pathologists play a key role in cancer detection and treatment. Pathological analysis—based on visual inspection of tissue samples under microscope—is inherently subjective in nature. Differences in visual perception and clinical training can lead to inconsistencies in diagnostic and prognostic opinions 71 , 72 , 73 . Here, DL can support critical medical tasks, including diagnostics, prognostication of outcomes and treatment response, pathology segmentation, disease monitoring, and so forth.

Recent years have seen the adoption of sub-micron-level resolution tissue scanners that capture gigapixel whole-slide images (WSI) 74 . This development, coupled with advances in CV has led to research and commercialization activity in AI-driven digital histopathology 75 . This field has the potential to (i) overcome limitations of human visual perception and cognition by improving the efficiency and accuracy of routine tasks, (ii) develop new signatures of disease and therapy from morphological structures invisible to the human eye, and (iii) combine pathology with radiological, genomic, and proteomic measurements to improve diagnosis and prognosis 76 .

One thread of research has focused on automating the routine, time-consuming task of localization and quantification of morphological features. Examples include the detection and classification of cells, nuclei, and mitoses 77 , 78 , 79 , and the localization and segmentation of histological primitives such as nuclei, glands, ducts, and tumors 80 , 81 , 82 , 83 . These methods typically require expensive manual annotation of tissue components by pathologists as training data.

Another research avenue focuses on direct diagnostics 84 , 85 , 86 and prognostics 87 , 88 from WSI or tissue microarrays (TMA) for a variety of cancers—breast, prostate, lung cancer, etc. Studies have even shown that morphological features captured by a hematoxylin and eosin (H&E) stain are predictive of molecular biomarkers utilized in theragnosis 85 , 89 . While histopathology slides digitize into massive, data-rich gigapixel images, region-level annotations are sparse and expensive. To help overcome this challenge, the field has developed DL algorithms based on multiple-instance learning 90 that utilize slide-level “weak” annotations and exploit the sheer size of these images for improved performance.

The data abundance of this domain has further enabled tasks such as virtual staining 91 , in which models are trained to predict one type of image (e.g. a stained image) from another (e.g. a raw microscopy image). See Fig. 1b . Moving forward, AI algorithms that learn to perform diagnosis, prognosis, and theragnosis using digital pathology image archives and annotations readily available from electronic health records have the potential to transform the fields of pathology and oncology.

Dermatology

The key clinical tasks for DL in dermatology include lesion-specific differential diagnostics, finding concerning lesions amongst many benign lesions, and helping track lesion growth over time 92 . A series of works have demonstrated that CNNs can match the performance of board-certified dermatologists at classifying malignant skin lesions from benign ones 7 , 93 , 94 . These studies have sequentially tested increasing numbers of dermatologists (25– 7 57– 93 , 157– 94 ), consistently demonstrating a sensitivity and specificity in classification that matches or even exceeds physician levels. These studies were largely restricted to the binary classification task of discerning benign vs malignant cutaneous lesions, classifying either melanomas from nevi or carcinomas from seborrheic keratoses.

Recently, this line of work has expanded to encompass differential diagnostics across dozens of skin conditions 95 , including non-neoplastic lesions such as rashes and genetic conditions, and incorporating non-visual metadata (e.g. patient demographics) as classifier inputs 96 . These works have been catalyzed by open-access image repositories and AI challenges that encourage teams to compete on predetermined benchmarks 97 .

Incorporating these algorithms into clinical workflows would allow their utility to support other key tasks, including large-scale detection of malignancies on patients with many lesions, and tracking lesions across images in order to capture temporal features, such as growth and color changes. This area remains fairly unexplored, with initial works that jointly train CNNs to detect and track lesions 98 .

Ophthalmology

Ophthalmology, in recent years, has observed a significant uptick in AI efforts, with dozens of papers demonstrating clinical diagnostic and analytical capabilities that extend beyond current human capability 99 , 100 , 101 . The potential clinical impact is significant 102 , 103 —the portability of the machinery used to inspect the eye means that pop-up clinics and telemedicine could be used to distribute testing sites to underserved areas. The field depends largely on fundus imaging, and optical coherence tomography (OCT) to diagnose and manage patients.

CNNs can accurately diagnose a number of conditions. Diabetic retinopathy—a condition in which blood vessels in the eyes of diabetic patients “leak” and can lead to blindness—has been extensively studied. CNNs consistently demonstrate physician-level grading from fundus photographs 104 , 105 , 106 , 107 , which has led to a recent US FDA-cleared system 108 . Similarly, they can diagnose or predict the progression of center-involved diabetic macular edema 109 , age-related macular degeneration 107 , 110 , glaucoma 107 , 111 , manifest visual field loss 112 , childhood blindness 113 , and others.

The eyes contain a number of non-human-interpretable features, indicative of meaningful medical information, that CNNs can pick up on. Remarkably, it was shown that CNNs can classify a number of cardiovascular and diabetic risk factors from fundus photographs 114 , including age, gender, smoking, hemoglobin-A1c, body-mass index, systolic blood pressure, and diastolic blood pressure. CNNs can also pick up signs of anemia 115 and chronic kidney disease 116 from fundus photographs. This presents an exciting opportunity for future AI studies predicting nonocular information from eye images. This could lead to a paradigm shift in care in which eye exams screen you for the presence of both ocular and nonocular disease—something currently limited for human physicians.

Medical video

Surgical applications.

The CV may provide significant utility in procedural fields such as surgery and endoscopy. Key clinical applications for deep learning include enhancing surgeon performance through real-time contextual awareness 117 , skills assessments, and training. Early studies have begun pursuing these objectives, primarily in video-based robotic and laparoscopic surgery—a number of works propose methods for detecting surgical tools and actions 118 , 119 , 120 , 121 , 122 , 123 , 124 . Some studies analyze tool movement or other cues to assess surgeon skill 119 , 121 , 123 , 124 , through established ratings such as the Global Operative Assessment of Laparoscopic Skills (GOALS) criteria for laparoscopic surgery 125 . Another line of work uses CV to recognize distinct phases of surgery during operations, towards developing context-aware computer assistance systems 126 , 127 . CV is also starting to emerge in open surgery settings 128 , of which there is a significant volume. The challenge here lies in the diversity of video capture viewpoints (e.g., head-mounted, side-view, and overhead cameras) and types of surgeries. For all types of surgical video, translating CV analysis to tools and applications that can improve patient outcomes is a natural next direction of research.

Human activity

CV can recognize human activity in physical spaces, such as hospitals and clinics, for a range of “ambient intelligence” applications. Ambient intelligence refers to a continuous, non-invasive awareness of activity in a physical space that can provide clinicians, nurses, and other healthcare workers with assistance such as patient monitoring, automated documentation, and monitoring for protocol compliance (Fig. 3 ). In hospitals, for example, early works have demonstrated CV-based ambient intelligence in intensive care units to monitor for safety-critical behaviors such as hand hygiene activity 32 and patient mobilization 8 , 129 , 130 . CV has also been developed for the emergency department, to transcribe procedures performed during the resuscitation of a patient 131 , and for the operating room (OR), to recognize activities for workflow optimization 132 . At the hospital operations level, CV can be a scalable and detailed form of labor and resource measurement that improves resource allocation for optimal care 133 .

figure 3

Computer vision coupled with sensors and video streams enables a number of safety applications in clinical and home settings, enabling healthcare providers to scale their ability to monitor patients. Primarily created using models for fine-grained activity recognition, applications may include patient monitoring in ICUs, proper hand hygiene and physical action protocols in hospitals and clinics, anomalous event detection, and others.

Outside of hospitals, ambient intelligence can increase access to healthcare. For instance, it could enable at-risk seniors to live independently at home, by monitoring for safety and abnormalities in daily activities (e.g. detecting falls, which are particularly dangerous for the elderly 134 , 135 ), assisted living, and physiological measurement. Similar work 136 , 137 , 138 has targeted broader categories of daily activity. Recognizing and computing long-term descriptive analytics of activities such as sleeping, walking, and sitting over time can detect clinically meaningful changes or anomalies 136 . To ensure patient privacy, researchers have developed CV algorithms that work with thermal video data 136 . Another application area of CV is assisted living or rehabilitation, such as continuous sign language recognition to assist people with communication difficulties 139 , and monitoring of physiotherapy exercises for stroke rehabilitation 140 . CV also offers potential as a tool for remote physiological measurements. For instance, systems could use video 141 to analyze heart and breathing rates 141 . As telemedicine visits increase in frequency, CV could play a role in patient triaging, particularly in times of high demand such as the COVID-19 pandemic 142 . CV-based ambient intelligence technologies offer a wide range of opportunities for increased access to quality care.; However new ethical and legal questions will arise 143 in the design of these technologies.

Clinical deployment

As medical AI advances into the clinic 144 , it will simultaneously have the power to do great good for society, and to potentially exacerbate long-standing inequalities and perpetuate errors in medicine. If done properly and ethically, medical AI can become a flywheel for more equitable care—the more it is used, the more data it acquires, the more accurate and general it becomes. The key is in understanding the data that the models are built on and the environment in which they are deployed. Here, we present four key considerations when applying ML technologies in healthcare: assessment of data, planning for model limitations, community participation, and trust building.

Data quality largely determines model quality; identifying inequities in the data and taking them into account will lead towards more equitable healthcare. Procuring the right datasets may depend on running human-in-the-loop programs or broad-reaching data collection techniques. There are a number of methods that aim to remove bias in data. Individual-level bias can be addressed via expert discussion 145 and labeling adjudication 146 . Population-level bias can be addressed via missing data supplements and distributional shifts. International multi-institutional evaluation is a robust method to determine generalizability of models across diverse populations, medical equipment, resource settings, and practice patterns. In addition, using multi-task learning 147 to train models to perform a variety of tasks rather than one narrowly defined task, such as multi-cancer detection from histopathology images 148 , makes them more generally useful and often more robust.

Transparent reporting can reveal potential weaknesses and help address model limitations. Guardrails to protect against possible worst-case scenarios—minority, dismissal, or automation bias—must be put in place. It is insufficient to report and be satisfied with strong performance measures on general datasets when delivering care for patients—there should be an understanding of the specific instances in which the model fails. One technique is to assess demographic performance in combination with saliency maps 149 , to visualize what the model pays attention to, and check for potential biases. For instance, when using deep learning to develop a differential diagnosis for skin diseases 95 , researchers examined the model performance based on Fitzpatrick skin types and other demographic information to determine patient types for which there were insufficient examples, and inform future data collection. Further, they used saliency masks to verify the model was informed by skin abnormalities and not skin type. See Fig. 4 .

figure 4

a Example graphic of biased training data in dermatology. AIs trained primarily on lighter skin tones may not generalize as well when tested on darker skin 157 . Models require diverse training datasets for maximal generalizability (e.g. 95 ). b Gradient Masks project the model’s attention onto the original input image, allowing practitioners to visually confirm regions that most influence predictions. Panel was reproduced from ref. 95 with permission.

A known limitation of ML is its performance on out-of-distribution data–data samples that are unlike any seen during model training. Progress has been made on out-of-distribution detection 150 and developing confidence intervals to help detect anomalies. Additionally, methods are developing to understand the uncertainty 151 around model outputs. This is especially critical when implementing patient-specific predictions that impact safety.

Community participation—from patients, physicians, computer scientists, and other relevant stakeholders—is paramount to successful deployment. This has helped identify structural drivers of racial bias in health diagnostics—particularly in discovering bias in datasets and identifying demographics for which models fail 152 . User-centered evaluations are a valuable tool in ensuring a system’s usability and fit into the real world. What’s the best way to present a model’s output to facilitate clinical decision making? How should a mobile app system be deployed in resource-constrained environments, such as areas with intermittent connectivity? For example, when launching ML-powered diabetic retinopathy models in Thailand and India, researchers noticed that model performance was impacted by socioeconomic factors 38 , and determined that where a model is most useful may not be where the model was generated. Ophthalmology models may need to be deployed in endocrinology care, as opposed to eye centers, due to access issues in the specific local environment. Another effective tool to build physician trust in AI results is side-by-side deployment of ML models with existing workflows (e.g manual grading 16 ). See Fig. 5 . Without question, AI models will require rigorous evaluation through clinical trials, to gauge safety and effectiveness. Excitingly, AI and CV can also help support clinical trials 153 , 154 through a number of applications—including patient selection, tumor tracking, adverse event detection, etc—creating an ecosystem in which AI can help design safe AI.

figure 5

An example workflow showing the positive compounding effect of AI-enhanced workflows, and the resultant trust that can be built. AI predictions provide immediate value to physicians, and improve over time as bigger datasets are collected.

Trust for AI in healthcare is fundamental to its adoption 155 both by clinical teams and by patients. The foundation of clinical trust will come in large part from rigorous prospective trials that validate AI algorithms in real-world clinical environments. These environments incorporate human and social responses, which can be hard to predict and control, but for which AI technologies must account for. Whereas the randomness and human element of clinical environments are impossible to capture in retrospective studies, prospective trials that best reflect clinical practice will shift the conversation towards measurable benefits in real deployments. Here, AI interpretability will be paramount—predictive models will need the ability to describe why specific factors about the patient or environment lead them to their predictions.

In addition to clinical trust, patient trust—particularly around privacy concerns—must be earned. One significant area of need is next-generation regulations that account for advances in privacy-preserving techniques. ML typically does not require traditional identifiers to produce useful results, but there are meaningful signals in data that can be considered sensitive. To unlock insights from these sensitive data types, the evolution of privacy-preserving techniques must continue, and further advances need to be made in fields such as federated learning and federated analytics.

Each technological wave affords us a chance to reshape our future. In this case, artificial intelligence, deep learning, and computer vision represent an opportunity to make healthcare far more accessible, equitable, accurate, and inclusive than it has ever been.

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Szeliski, R. Computer Vision: Algorithms and Applications (Springer Science & Business Media, 2010).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   CAS   PubMed   Google Scholar  

Sanders, J. & Kandrot, E. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional; 2010 Jul 19.BibTeXEndNoteRefManRefWorks

Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25 , 44–56 (2019).

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Yeung, S. et al. A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit Med. 2 , 11 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115 , 211–252 (2015).

Article   Google Scholar  

Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Information Processing Systems 25 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc., 2012).

Sermanet, P. et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. Preprint at https://arxiv.org/abs/1312.6229 (2013).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).

Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2015).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

Gebru, T., Hoffman, J. & Fei-Fei, L. Fine-grained recognition in the wild: a multi-task domain adaptation approach. In 2017 IEEE International Conference on Computer Vision (ICCV) 1358–1367 (IEEE, 2017).

Gulshan, V. et al. Performance of a deep-learning algorithm vs manual grading for detecting diabetic retinopathy in india. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2019.2004 (2014).

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention 234–241 (Springer, Cham, 2015).

Isensee, F. et al. nnU-Net: self-adapting framework for U-Net-based medical image segmentation. Preprint at https://arxiv.org/abs/1809.10486 (2018).

LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks 255–258 (MIT Press, 1998).

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V. & Le, Q. V. AutoAugment: learning augmentation policies from data. Preprint at https://arxiv.org/abs/1805.09501 (2018).

Goodfellow, I. et al. Generative adversarial nets. In Advances inneural information processing systems 2672–2680 (2014).

Ørting, S. et al. A survey of Crowdsourcing in medical image analysis. Preprint at https://arxiv.org/abs/1902.09159 (2019).

Créquit, P., Mansouri, G., Benchoufi, M., Vivot, A. & Ravaud, P. Mapping of Crowdsourcing in health: systematic review. J. Med. Internet Res. 20 , e187 (2018).

Jing, L. & Tian, Y. in IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2020).

McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics 1273–1282 (PMLR, 2017).

Karpathy, A. & Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3128–3137 (IEEE, 2015).

Lv, D. et al. Research on the technology of LIDAR data processing. In 2017 First International Conference on Electronics Instrumentation Information Systems (EIIS) 1–5 (IEEE, 2017).

Lillo, I., Niebles, J. C. & Soto, A. Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos. Image Vis. Comput. 59 , 63–75 (2017).

Haque, A. et al. Towards vision-based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Proceedings of the 2nd Machine Learning for Healthcare Conference , 68 , 75–87 (PMLR, 2017).

Heilbron, F. C., Escorcia, V., Ghanem, B. & Niebles, J. C. ActivityNet: a large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 961–970 (IEEE, 2015).

Liu, Y. et al. Learning to describe scenes with programs. In ICLR (Open Access, 2019).

Singh, A. et al. Automatic detection of hand hygiene using computer visiontechnology. J. Am. Med. Inform. Assoc. 27 , 1316–1320 (2020).

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).

Article   PubMed   Google Scholar  

Maron, O. & Lozano-Pérez, T. in A Framework for Multiple-Instance Learning. in Advances in Neural Information Processing Systems 10 (eds Jordan, M. I., Kearns, M. J. & Solla, S. A.) 570–576 (MIT Press, 1998).

Singh, S. P. et al. 3D Deep Learning On Medical Images: A Review. Sensors 20, https://doi.org/10.3390/s20185097 (2020).

Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580 , 252–256 (2020).

Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit. Med. 3 , 118 (2020).

Beede, E. et al. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–12 (Association for Computing Machinery, 2020).

Viz.ai Granted Medicare New Technology Add-on Payment. PR Newswire https://www.prnewswire.com/news-releases/vizai-granted-medicare-new-technology-add-on-payment-301123603.html (2020).

Crowson, M. G. et al. A contemporary review of machine learning in otolaryngology-head and neck surgery. Laryngoscope 130 , 45–51 (2020).

Livingstone, D., Talai, A. S., Chau, J. & Forkert, N. D. Building an Otoscopic screening prototype tool using deep learning. J. Otolaryngol. Head. Neck Surg. 48 , 66 (2019).

Chen, P.-H. C. et al. An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis. Nat. Med. 25 , 1453–1457 (2019).

Gunčar, G. et al. An application of machine learning to haematological diagnosis. Sci. Rep. 8 , 411 (2018).

Article   PubMed   PubMed Central   CAS   Google Scholar  

Alam, M. M. & Islam, M. T. Machine learning approach of automatic identification and counting of blood cells. Health. Technol. Lett. 6 , 103–108 (2019).

El Hajjar, A. & Rey, J.-F. Artificial intelligence in gastrointestinal endoscopy: general overview. Chin. Med. J. 133 , 326–334 (2020).

Horie, Y. et al. Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks. Gastrointest. Endosc. 89 , 25–32 (2019).

Hirasawa, T. et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer 21 , 653–660 (2018).

Kubota, K., Kuroda, J., Yoshida, M., Ohta, K. & Kitajima, M. Medical image analysis: computer-aided diagnosis of gastric cancer invasion on endoscopic images. Surg. Endosc. 26 , 1485–1489 (2012).

Itoh, T., Kawahira, H., Nakashima, H. & Yata, N. Deep learning analyzes Helicobacter pylori infection by upper gastrointestinal endoscopy images. Endosc. Int Open 6 , E139–E144 (2018).

He, J.-Y., Wu, X., Jiang, Y.-G., Peng, Q. & Jain, R. Hookworm detection in wireless capsule endoscopy images with deep learning. IEEE Trans. Image Process. 27 , 2379–2392 (2018).

Park, S.-M. et al. A mountable toilet system for personalized health monitoring via the analysis of excreta. Nat. Biomed. Eng. 4 , 624–635 (2020).

VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF. Hum. Reprod. 35 , 770–784 (2020).

Choy, G. et al. Current applications and future impact of machine learning in radiology. Radiology 288 , 318–328 (2018).

Saba, L. et al. The present and future of deep learning in radiology. Eur. J. Radiol. 114 , 14–24 (2019).

Mazurowski, M. A., Buda, M., Saha, A. & Bashir, M. R. Deep learning in radiology: an overview of the concepts and a survey of the state of the art with focus on MRI. J. Magn. Reson. Imaging 49 , 939–954 (2019).

Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6 , 317 (2019).

Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. of the AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (2019).

Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervisedclassification and localization of common thorax diseases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2097–2106 (2017).

Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392 , 2388–2396 (2018).

Weston, A. D. et al. Automated abdominal segmentation of CT scans for body composition analysis using deep learning. Radiology 290 , 669–679 (2019).

Ding, J., Li, A., Hu, Z. & Wang, L. in Medical Image Computing and Computer Assisted Intervention—MICCAI 2017 559–567 (Springer International Publishing, 2017).

Tan, L. K., Liew, Y. M., Lim, E. & McLaughlin, R. A. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med. Image Anal. 39 , 78–86 (2017).

Zhang, J. et al. Viral pneumonia screening on chest X-ray images using confidence-aware anomaly detection. Preprint at https://arxiv.org/abs/2003.12338 (2020).

Zhang, X., Feng, C., Wang, A., Yang, L. & Hao, Y. CT super-resolution using multiple dense residual block based GAN. J. VLSI Signal Process. Syst. Signal Image Video Technol. , https://doi.org/10.1007/s11760-020-01790-5 (2020).

Papolos, A., Narula, J., Bavishi, C., Chaudhry, F. A. & Sengupta, P. P. U. S. Hospital use of echocardiography: insights from the nationwide inpatient sample. J. Am. Coll. Cardiol. 67 , 502–511 (2016).

HeartFlowNXT—HeartFlow Analysis of Coronary Blood Flow Using Coronary CT Angiography—Study Results—ClinicalTrials.gov. https://clinicaltrials.gov/ct2/show/results/NCT01757678 .

Madani, A., Arnaout, R., Mofrad, M. & Arnaout, R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit. Med. 1 , 6 (2018).

Zhang, J. et al. Fully automated echocardiogram interpretation in clinical practice. Circulation 138 , 1623–1635 (2018).

Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit. Med. 3 , 10 (2020).

Madani, A., Ong, J. R., Tibrewal, A. & Mofrad, M. R. K. Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease. NPJ Digit. Med. 1 , 59 (2018).

Perkins, C., Balma, D. & Garcia, R. Members of the Consensus Group & Susan G. Komen for the Cure. Why current breast pathology practices must be evaluated. A Susan G. Komen for the Cure white paper: June 2006. Breast J. 13 , 443–447 (2007).

Brimo, F., Schultz, L. & Epstein, J. I. The value of mandatory second opinion pathology review of prostate needle biopsy interpretation before radical prostatectomy. J. Urol. 184 , 126–130 (2010).

Elmore, J. G. et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA 313 , 1122–1132 (2015).

Evans, A. J. et al. US food and drug administration approval of whole slide imaging for primary diagnosis: a key milestone is reached and new questions are raised. Arch. Pathol. Lab. Med. 142 , 1383–1387 (2018).

Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: A survey. Medical Image Analysis . p. 101813 (2020).

Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16 , 703–715 (2019).

Cireşan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013 411–418 (Springer Berlin Heidelberg, 2013).

Wang, H. et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J. Med Imaging (Bellingham) 1 , 034003 (2014).

Kashif, M. N., Ahmed Raza, S. E., Sirinukunwattana, K., Arif, M. & Rajpoot, N. Handcrafted features with convolutional neural networks for detection of tumor cells in histology images. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI) 1029–1032 (IEEE, 2016).

Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep learning for identifying metastatic breast cancer. Preprint at https://arxiv.org/abs/1606.05718 (2016).

BenTaieb, A. & Hamarneh, G. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016 460–468 (Springer International Publishing, 2016).

Chen, H. et al. DCAN: Deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36 , 135–146 (2017).

Xu, Y. et al. Gland instance segmentation using deep multichannel neural networks. IEEE Trans. Biomed. Eng. 64 , 2901–2912 (2017).

Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6 , 26286 (2016).

Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24 , 1559–1567 (2018).

Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25 , 1301–1309 (2019).

Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. U. S. A. 115 , E2970–E2979 (2018).

Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25 , 1519–1525 (2019).

Rawat, R. R. et al. Deep learned tissue ‘fingerprints’ classify breast cancers by ER/PR/Her2 status from H&E images. Sci. Rep. 10 , 7275 (2020).

Dietterich, T. G., Lathrop, R. H. & Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89 , 31–71 (1997).

Christiansen, E. M. et al. In silico labeling: predicting fluorescent labels in unlabeled images. Cell 173 , 792–803.e19 (2018).

Esteva, A. & Topol, E. Can skin cancer diagnosis be transformed by AI? Lancet 394 , 1795 (2019).

Haenssle, H. A. et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29 , 1836–1842 (2018).

Brinker, T. J. et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 113 , 47–54 (2019).

Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26 , 900–908 (2020).

Yap, J., Yolland, W. & Tschandl, P. Multimodal skin lesion classification using deep learning. Exp. Dermatol. 27 , 1261–1267 (2018).

Marchetti, M. A. et al. Results of the 2016 International Skin Imaging Collaboration International Symposium on Biomedical Imaging challenge: Comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images. J. Am. Acad. Dermatol. 78 , 270–277 (2018).

Li, Y. et al. Skin cancer detection and tracking using data synthesis and deep learning. Preprint at https://arxiv.org/abs/1612.01074 (2016).

Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol. 103 , 167–175 (2019).

Keane, P. A. & Topol, E. J. With an eye to AI and autonomous diagnosis. NPJ Digit. Med. 1 , 40 (2018).

Keane, P. & Topol, E. Reinventing the eye exam. Lancet 394 , 2141 (2019).

De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 , 1342–1350 (2018).

Kern, C. et al. Implementation of a cloud-based referral platform in ophthalmology: making telemedicine services a reality in eye care. Br. J. Ophthalmol. 104 , 312–317 (2020).

Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).

Raumviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit Med. 2 , 25 (2019).

Abràmoff, M. D. et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest. Ophthalmol. Vis. Sci. 57 , 5200–5206 (2016).

Ting, D. S. W. et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318 , 2211–2223 (2017).

Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 1 , 39 (2018).

Varadarajan, A. V. et al. Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning. Nat. Commun. 11 , 130 (2020).

Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26 , 892–899 (2020).

Li, Z. et al. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 125 , 1199–1206 (2018).

Yousefi, S. et al. Detection of longitudinal visual field progression in glaucoma using machine learning. Am. J. Ophthalmol. 193 , 71–79 (2018).

Brown, J. M. et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 136 , 803–810 (2018).

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158–164 (2018).

Mitani, A. et al. Detection of anaemia from retinal fundus images via deep learning. Nat. Biomed. Eng. 4 , 18–27 (2020).

Sabanayagam, C. et al. A deep learning algorithm to detect chronic kidney disease from retinal photographs in community-based populations. Lancet Digital Health 2 , e295–e302 (2020).

Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1 , 691–696 (2017).

García-Peraza-Herrera, L. C. et al. ToolNet: Holistically-nested real-time segmentation of robotic surgical tools. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5717–5722 (IEEE, 2017).

Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L. & Essa, I. Video and accelerometer-based motion analysis for automated surgical skills assessment. Int. J. Comput. Assist. Radiol. Surg. 13 , 443–455 (2018).

Sarikaya, D., Corso, J. J. & Guru, K. A. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Trans. Med. Imaging 36 , 1542–1549 (2017).

Jin, A. et al. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 691–699 (IEEE, 2018).

Twinanda, A. P. et al. EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36 , 86–97 (2017).

Lin, H. C., Shafran, I., Yuh, D. & Hager, G. D. Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions. Comput. Aided Surg. 11 , 220–230 (2006).

Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B. & Rudzicz, F. Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw. Open 3 , e201664 (2020).

Vassiliou, M. C. et al. A global assessment tool for evaluation of intraoperative laparoscopic skills. Am. J. Surg. 190 , 107–113 (2005).

Jin, Y. et al. SV-RCNet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37 , 1114–1126 (2018).

Padoy, N. et al. Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16 , 632–641 (2012).

Azari, D. P. et al. Modeling surgical technical skill using expert assessment for automated computer rating. Ann. Surg. 269 , 574–581 (2019).

Ma, A. J. et al. Measuring patient mobility in the ICU using a novel noninvasive sensor. Crit. Care Med. 45 , 630–636 (2017).

Davoudi, A. et al. Intelligent ICU for autonomous patient monitoring using pervasive sensing and deep learning. Sci. Rep. 9 , 8020 (2019).

Chakraborty, I., Elgammal, A. & Burd, R. S. Video based activity recognition in trauma resuscitation. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) 1–8 (IEEE, 2013).

Twinanda, A. P., Alkan, E. O., Gangi, A., de Mathelin, M. & Padoy, N. Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms. Int. J. Comput. Assist. Radiol. Surg. 10 , 737–747 (2015).

Kaplan, R. S. & Porter, M. E. How to solve the cost crisis in health care. Harv. Bus. Rev. 89 , 46–52 (2011). 54, 56–61 passim.

PubMed   Google Scholar  

Wang, S., Chen, L., Zhou, Z., Sun, X. & Dong, J. Human fall detection in surveillance video based on PCANet. Multimed. Tools Appl. 75 , 11603–11613 (2016).

Núñez-Marcos, A., Azkune, G. & Arganda-Carreras, I. Vision-Based Fall Detection with Convolutional Neural Networks. In Proc. International Wireless Communications and Mobile Computing Conference 2017 (ACM, 2017).

Luo, Z. et al. Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. In Machine Learning for Healthcare (MLHC) 2 (JMLR, 2018).

Zhang, C. & Tian, Y. RGB-D camera-based daily living activity recognition. J. Comput. Vis. image Process. 2 , 12 (2012).

Pirsiavash, H. & Ramanan, D. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition 2847–2854 (IEEE, 2012).

Kishore, P. V. V., Prasad, M. V. D., Kumar, D. A. & Sastry, A. S. C. S. Optical flow hand tracking and active contour hand shape features for continuous sign language recognition with artificial neural networks. In 2016 IEEE 6th International Conference on Advanced Computing (IACC) 346–351 (IEEE, 2016).

Webster, D. & Celik, O. Systematic review of Kinect applications in elderly care and stroke rehabilitation. J. Neuroeng. Rehabil. 11 , 108 (2014).

Chen, W. & McDuff, D. Deepphys: video-based physiological measurement using convolutional attention networks. In Proc. European Conference on Computer Vision (ECCV) 349–365 (Springer Science+Business Media, 2018).

Moazzami, B., Razavi-Khorasani, N., Dooghaie Moghadam, A., Farokhi, E. & Rezaei, N. COVID-19 and telemedicine: Immediate action required for maintaining healthcare providers well-being. J. Clin. Virol. 126 , 104345 (2020).

Gerke, S., Yeung, S. & Cohen, I. G. Ethical and legal aspects of ambient intelligence in hospitals. JAMA https://doi.org/10.1001/jama.2019.21699 (2020).

Young, A. T., Xiong, M., Pfau, J., Keiser, M. J. & Wei, M. L. Artificial intelligence in dermatology: a primer. J. Invest. Dermatol. 140 , 1504–1512 (2020).

Schaekermann, M., Cai, C. J., Huang, A. E. & Sayres, R. Expert discussions improve comprehension of difficult cases in medical image assessment. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–13 (Association for Computing Machinery, 2020).

Schaekermann, M. et al. Remote tool-based adjudication for grading diabetic retinopathy. Transl. Vis. Sci. Technol. 8 , 40 (2019).

Caruana, R. Multitask learning. Mach. Learn. 28 , 41–75 (1997).

Wulczyn, E. et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE 15 , e0233678 (2020).

Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).

Ren, J. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 14707–14718 (Curran Associates, Inc., 2019).

Dusenberry, M. W. et al. Analyzing the role of model uncertainty for electronic health records. In Proc. ACM Conference on Health, Inference, and Learning 204–213 (Association for Computing Machinery, 2020).

Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 , 447–453 (2019).

Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension. BMJ 370 , m3164 (2020).

Rivera, S. C. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 370 , m3210 (2020).

Asan, O., Bayrak, A. E. & Choudhury, A. Artificial intelligence and human trust in healthcare: focus on clinicians. J. Med. Internet Res. 22 , e15154 (2020).

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577 , 89–94 (2020).

Kamulegeya, L. H. et al. Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning. https://doi.org/10.1101/826057 (2019).

Download references

Acknowledgements

The authors would like to thank Melvin Gruesbeck for the design of the figures, and Elise Kleeman for editorial review.

Author information

These authors contributed equally: Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi.

Authors and Affiliations

Salesforce AI Research, San Francisco, CA, USA

Andre Esteva, Nikhil Naik, Ali Madani & Richard Socher

Google Research, Mountain View, CA, USA

Katherine Chou, Yun Liu & Jeff Dean

Stanford University, Stanford, CA, USA

Serena Yeung & Ali Mottaghi

Scripps Research Translational Institute, La Jolla, CA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

A.E. organized the authors, synthesized the writing, and led the abstract, introduction, computer vision, dermatology, and ophthalmology sections. S.Y. led the medical video section. K.C. led the clinical deployment section. N.N. contributed the pathology section, Ali Madani contributed the cardiology section, Ali Mottaghi contributed to the sections within the medical video, and E.T. and J.D. contributed to the clinical deployment section. Y.L. significantly contributed to the figures, and writing style. All authors contributed to the overall writing and storyline. E.T., J.D., and R.S. oversaw and advised the work.

Corresponding author

Correspondence to Andre Esteva .

Ethics declarations

Competing interests.

A.E., N.N., Ali Madani, and R.S. are or were employees of Salesforce.com and own Salesforce stock. K.C., Y.L., and J.D. are employees of Google, L.L.C. and own Alphabet stock. S.Y., Ali Mottaghi and E.T. have no competing interests to declare.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Esteva, A., Chou, K., Yeung, S. et al. Deep learning-enabled medical computer vision. npj Digit. Med. 4 , 5 (2021). https://doi.org/10.1038/s41746-020-00376-2

Download citation

Received : 17 August 2020

Accepted : 01 December 2020

Published : 08 January 2021

DOI : https://doi.org/10.1038/s41746-020-00376-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper of computer vision

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

Topic Information

Participating journals, related topic, topic editors.

research paper of computer vision

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

Computer Vision and Image Processing

research paper of computer vision

Dear Colleagues,

Computer vision is a scientific discipline that aims at developing models for understanding our 3D environment using cameras. Further, image processing can be understood as the whole body of techniques that extract useful information directly from images or to process them for optimal subsequent analysis. At any rate, computer vision and image processing are two closely related fields which can be considered as a work area used in almost any research involving cameras or any image sensor to acquire information from the scenes or working environments. Thus, the main aim of this Topic is to cover some of the relevant areas where computer vision/image processing is applied, including but not limited to:

  • Three-dimensional image acquisition, processing, and visualization
  • Scene understanding
  • Greyscale, color, and multispectral image processing
  • Multimodal sensor fusion
  • Industrial inspection
  • Surveillance
  • Airborne and satellite on-board image acquisition platforms.
  • Computational models of vision
  • Imaging psychophysics

Prof. Dr. Silvia Liberata Ullo Topic Editor

  • 3D acquisition, processing, and visualization
  • scene understanding
  • multimodal sensor processing and fusion
  • multispectral, color, and greyscale image processing
  • industrial quality inspection
  • computer vision for robotics
  • computer vision for surveillance
  • airborne and satellite on-board image acquisition platforms
  • computational models of vision
  • imaging psychophysics
Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
applsci 2011 17.8 Days CHF 2400
electronics 2012 16.8 Days CHF 2400
modelling 2020 21.2 Days CHF 1000
jimaging 2015 20.9 Days CHF 1800

research paper of computer vision

  • Immediately share your ideas ahead of publication and establish your research priority;
  • Protect your idea from being stolen with this time-stamped preprint article;
  • Enhance the exposure and impact of your research;
  • Receive feedback from your peers in advance;
  • Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.
  • Computer Vision and Image Processing, 2nd Edition (43 articles)

Published Papers (101 papers)

research paper of computer vision

Graphical abstract

research paper of computer vision

Further Information

Mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Submit your Manuscript

Submit your abstract.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

CVF

CVF Sponsored Conferences

Cvf sponsored conferences errata.

It is the policy of the Computer Vision Foundation to maintain PDF copies of conference papers as submitted during the camera-ready paper collection. These papers are considered the final published versions of the work. We recognize the need for minor corrections after publication, and thus provide links to arXiv versions of the papers where available. If a correction must be made, it should be made as an update to the arXiv version of the paper by the authors. The CVF maintainers should then be notified of the update via email ( [email protected] ). The conference open access website will be updated periodically to indicate changes made to an arXiv version since the original conference publication date. The original camera-ready version of the paper will be maintained within the open access archive, and will not be removed or replaced by request.

Other Computer Vision Conferences and Workshops

PAWS Princeton Advanced Wireless Systems

Advanced topics in computer science: recent advances in computer vision.

Computer vision is a rapidly-evolving field, with technological innovations enabling societal impact, and societal needs fueling innovation. We select a few advanced computer vision topics to explore, focusing in particular on the robustness, transparency and fairness of computer vision systems. Students are expected to routinely read and present research papers, with special attention on developing excellent oral and written scientific communication skills

Your request couldn't be processed

There was a problem with this request. We're working on getting it fixed as soon as we can.

Guest Editorial: Special Issue on Open-World Visual Recognition

  • Published: 28 August 2024

Cite this article

research paper of computer vision

  • Zhun Zhong 1 , 2 ,
  • Hong Liu 3 ,
  • Yin Cui 4 ,
  • Shin’ichi Satoh 5 ,
  • Nicu Sebe 6 &
  • Ming-Hsuan Yang 7  

228 Accesses

Explore all metrics

Avoid common mistakes on your manuscript.

1 Introduction

This special issue is devoted to the theme of open-world visual recognition. Visual recognition is a critical task in computer vision, which has gained significant attention in recent years due to its numerous applications in various fields, including image classification, object detection, semantic segmentation, instance retrieval, etc. Despite significant progress, existing visual models continue to face considerable challenges in open-world scenarios. These challenges include recognizing novel classes, adapting to unseen domains, learning under constraints of data privacy, enhancing robustness against adversarial samples, among others. This special issue provides a comprehensive overview of the latest advancements in open-world visual recognition, aiming to address these complex issues.

The call for papers for this special issue attracted a total of 144 submissions, reflecting the community’s strong interest and ongoing research efforts in this area. After a rigorous peer-review process, consistent with the journal’s high standards for quality and innovation, 44 papers have been accepted. This results in an acceptance rate of 30.5% (44/144), highlighting the competitive nature of our selection process. The accepted papers showcase cutting-edge research and are mainly organized into seven thematic categories: open-set & open-vocabulary recognition, domain adaptation & generalization, out-of-distribution detection, learning with imperfect training labels, novel class discovery, incremental learning and other open-world applications.

2 Overview of Accepted Papers

2.1 open-set & open-vocabulary recognition.

This part of the issue consists of 11 papers. The first article , by Zhang et al., introduces the Open-Vocabulary Keypoint Detection (OVKD) task and proposes a novel framework based on semantic-feature matching, which combines vision and language models to link language features with local visual keypoint features. The second article , by Shi et al., presents a novel approach for open-vocabulary semantic segmentation by leveraging the capabilities of large language models (LLMs) instead of traditional vision-language (VL) pre-training models like CLIP. The third article , by Wang et al., introduces the task of Open-Vocabulary Video Instance Segmentation (OV-VIS) and proposes a transformer-based model to solve this task. The fourth article , Chen et al., studies the task of open-vocabulary object detection and attributes recognition and proposes an effective framework that disentangles the task into class-agnostic object proposal and open-vocabulary classification. The fifth article , by Thawakar et al., studies the problem of open-world video instance segmentation (VIS) and introduces a novel framework with feature enrichment mechanism and spatio-temporal objectness module. The sixth article , by Tang et al., proposes a training-free paradigm for open-world segmentation, which effectively harnesses the power of vision foundational models. The seventh article , by Yang et al., introduces the prototype-based segmentation framework that can combine textual and visual clues, providing comprehensive support for open-world semantic segmentation. The eighth article , by Chakravarthy et al., studies the problem of open-world lidar panoptic segmentation. It discovers the drawbacks of existing methods on this task and suggests a balanced approach that can achieve strong performance on both known and unknown classes. The ninth article , by Yang et al., introduces the Causal Inference-inspired approach for real-world open-set recognition, addressing challenges posed by covariate and semantic shifts. The tenth article , by Zuo et al., integrates vision-language embeddings from foundation models into 3D Gaussian Splatting, which can enhance multi-view semantic consistency and thus facilitate downstream tasks, such as open-vocabulary object detection. The eleventh article , by Xie et al., introduces a diffusion-based data augmentation technique for large vocabulary instance segmentation, which operates without training or label supervision.

2.2 Domain Adaptation & Generalization

This part of the issue consists of 13 papers. The first article , by Wu et al., proposes a domain-aware prompting approach for cross-domain few-shot learning, which learns a hybridly prompted model for enhancing adaptability on unseen domains. The second article , by Dai et al., studies the task of cross-domain person re-identification and proposes a novel framework that can generate intermediate domains for improving the knowledge transfer between source and target domains. The third article , by Zhao et al., introduces the problem of Multi-Source-Free Domain Adaptive Object Detection (MSFDAOD) and proposes a Divide-and-Aggregate Contrastive Adaptation (DACA) framework that can efficiently leverage the advantages of multiple source-free models and aggregate their contributions to adaptation in a self-supervised manner. The fourth article , by Gu et al., introduces an adversarial re-weighting approach for partial domain adaptation, which can reduce the influence of source-private classes and minimize prediction uncertainty in the target domain. The fifth article , by Zhang et al., proposes a new framework for source-free domain adaptation, which incorporates pre-trained networks into the adaptation process to improve the quality of target pseudo-labels. The sixth article , by Liang et al., provides a survey of test-time adaptation (TTA), which categorizes TTA into several distinct groups, provides a comprehensive taxonomy of advanced algorithms for each group, and analyzes relevant applications of TTA. The seventh article , by Wang et al., presents a comprehensive survey on online test-time adaptation (OTTA), re-implements existing OTTA methods with Vision Transformer, and proposes novel evaluation metrics that consider both accuracy and efficiency. The eighth article , by Yang et al., introduces the Hierarchical Visual Transformation (HVT) network to help the model learn domain-invariant representation and narrow the domain gap in various visual matching and recognition tasks. The ninth article , by Hu et al., introduces a large-scale benchmark for domain generalizable person re-identification as well as proposes a novel framework based on diverse feature space learning to learn domain-adaptive discriminative representations. The tenth article , by Wang et al., studies the problem of domain generalized unmanned aerial vehicle object detection and proposes a novel frequency domain disentanglement method to improve model’s generalization ability on this challenging task. The eleventh article , by Luo et al., proposes a method based on network pruning for domain generalized semantic segmentation, which can prune the filters or attention heads that are more sensitive to domain shift. The twelfth article , by Huang et al., considers a specific domain generalization task, i.e., out-of-distribution generalization and presents an Exploring Variant parameters for Invariant Learning (EVIL) approach to find the parameters that are sensitive to distribution shift. The thirteen article , by Li et al., studies the model’s robustness to adversarial examples, which can be regarded as a specific category of domain generalization. This article proposes a novel method to automatically learn online, instance-wise, data augmentation policies for improving robust generalization.

2.3 Out-of-Distribution Detection

This part of the issue consists of 6 papers. The first article , by He et al., considers the task of Out-of-Distribution (OOD) detection with noisy examples in the training set. It introduces the Adversarial Confounder Removing (ACRE) method, which utilizes progressive optimization with adversarial learning to curate collections of easy-ID, hard-ID, and open-set noisy examples as well as to reduce spurious-related representations. The second article , by Nie et al., introduces Virtual Outlier Smoothing (VOSo) for OOD detection by generating auxiliary outliers using in-distribution samples. The third article , by Cheng et al., takes the advantage of the breakthrough of generative models and demonstrates that training with a large quantity of generated data can eliminate overfitting in reliable prediction tasks, e.g., OOD detection. The fourth article , by Fang et al., introduces a novel perspective on OOD detection by exploring the loss landscape and mode ensemble, showing the effectiveness of mode ensemble in enhancing OOD detection. The fifth article , by Yang et al., provides a survey of OOD detection methods and presents a unified framework called generalized OOD detection, which encompasses five highly related tasks, i.e., OOD detection, anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). In addition, it provides a comprehensive discussion of representative methods from other tasks and how they relate and inspire the development of OOD detection methods. The sixth paper , by Wang et al., provides a consolidated analysis of OOD detection and open-set recognition (OSR), performing cross-evaluation of state-of-the-art methods, proposing a new large-scale benchmark, and providing empirical analysis on existing methods and correlation between OOD detection and OSR.

2.4 Learning with Imperfect Training Labels

This part of the issue consists of 7 papers. The first article , by Butt et al., introduces a large-scale dataset for road segmentation in challenging unstructured roadways. It proposes an Efficient Data Sampling (EDS) based self-training framework for semi-supervised learning setting. The second article , by Sun et al., introduces Variational Rectification Inference (VRI) to address the problem of learning with noisy labels by formulating adaptive loss rectification as an amortized variational inference problem. The third article , by Xie et al., presents the Probabilistic Representation Contrastive Learning (PRCL) framework for semi-supervised semantic segmentation, which enhances the robustness of the unsupervised training process. The fourth article , by Zhao et al., addresses the task of open-set semi-supervised learning and proposes a prototype-based clustering and identification algorithm to enhance feature learning. The fifth article , by Sun et al., introduces the Open-World DeepFake Attribution task and benchmark, where the unlabeled dataset may contain attacks that have never been encountered in the labeled set, and proposes the Multi-Perspective Sensory Learning (MPSL) framework to solve this task. The sixth article , by Qiao et al., presents Adaptive Fuzzy Positive Learning (A-FPL) for annotation-scarce semantic segmentation, which can effectively alleviate interference from wrong pseudo labels and progressively refining semantic discrimination. The seventh article , by Siméoni et al., provides a survey of different unsupervised object localization methods that discover objects in images without requiring any manual annotation in the era of self-supervised ViTs.

2.5 Novel Class Discovery

This part of the issue consists of 2 papers. The first article , by Chi et al., studies novel class discovery task under unreliable sampling and proposes a hidden-prototype-based discovery network (HPDN) to handle sampling errors. The second article , by Riz et al., introduces the task of Novel Class Discovery (NCD) for 3D point cloud semantic segmentation. To solve this problem, it proposes a new method utilizing online clustering, uncertainty estimation, and semantic distillation.

2.6 Incremental Learning

This part of the issue consists of 2 papers. The first article , by Xuan et al., introduces the concept of Incremental Model Enhancement (IME), where training data arrives sequentially and each training split typically corresponds to a set of independent classes, domains, or tasks. It proposes a Memory-based Contrastive Learning framework, which shows superiority on both image classification and semantic segmentation tasks. The second article , by Zhou et al., revisits Class-Incremental Learning (CIL) in the context of pre-trained models (PTMs) and shows that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. It also proposes a general framework by aggregating the embeddings of PTM and adapted models for classifier construction.

2.7 Other Open-World Applications

This part of the issue consists of 3 papers. The first article , by Want et al., proposes a specific open-world visual recognition task, i.e., Pattern-Expandable Image Copy Detection, aiming to identify novel tamper patterns. It proposes Pattern Stripping, which can easily introduce new pattern features with minimal impact on the image feature and previously seen pattern features. The second article , by Xu et al., addresses the challenge of visual object tracking in hazy conditions and introduces a feature restoration transformer to improve model’s robustness under hazy imaging scenarios. The third article , by Shi et al., introduces a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for Panoptic Scene Graph Generation (PSG), which is effective on both robust and zero-shot PSG tasks.

3 Conclusion

The 44 contributions in this special issue offer a diverse array of perspectives aimed at addressing the challenges in the field of open-world visual recognition. These articles not only underscore the ongoing dialogue within the community but also highlight innovative approaches to tackle real-world issues effectively. Through this special issue, we aim to spark further research and discussion within the community, encouraging continued advancements and practical applications of open-world visual recognition.

Finally, we would like to express our heartfelt gratitude to the dedicated reviewers who devoted their valuable time and effort to thoroughly review the papers and provide constructive feedback to the authors. We also extend our appreciation to the diligent editorial team at Springer and International Journal of Computer Vision, especially Prof. Yasuyuki Matsushita, Ms. Yasotha Sujeen, and Ms. Katherine Moretti. Their invaluable assistance was crucial in the successful publication of this special issue.

Author information

Authors and affiliations.

Hefei University of Technology, Hefei, 230002, China

University of Nottingham, Nottingham, NG8 1BB, UK

Osaka University, Osaka, 565-0871, Japan

NVIDIA, Santa Clarita, 95051, USA

National Institute of Informatics, Tokyo, 101-8430, Japan

Shin’ichi Satoh

University of Trento, Trento, 38123, Italy

University of California at Merced, Merced, CA, 95344, USA

Ming-Hsuan Yang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hong Liu .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Zhong, Z., Liu, H., Cui, Y. et al. Guest Editorial: Special Issue on Open-World Visual Recognition. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02232-2

Download citation

Published : 28 August 2024

DOI : https://doi.org/10.1007/s11263-024-02232-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: text-to-image generation via energy-based clip.

Abstract: Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Deep Learning For Computer Vision Tasks: A review

    research paper of computer vision

  2. (PDF) Cultivating Research in Computer Vision within Graduates and Post

    research paper of computer vision

  3. Top 50 Papers in Federated Learning for Computer Vision

    research paper of computer vision

  4. (PDF) Computer Vision for Interactive Computer Graphics

    research paper of computer vision

  5. (PDF) Deep Learning based Computer Vision Methods for Complex Traffic

    research paper of computer vision

  6. Review Papers in the area of Computer Vision

    research paper of computer vision

VIDEO

  1. Arrays with function-based initialisation in F#

  2. Eberhard Klempt: 50 Years of QCD

  3. F# Tutorial: Introducing the actor model

  4. Using the Array.fold function in F#

  5. F# Tutorial: Using the Array.collect function

  6. F# Tutorial: Recursive types

COMMENTS

  1. Computer Vision

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4926 benchmarks • 1501 tasks • 3204 datasets • 51912 papers with code Semantic Segmentation ... 5729 papers with code

  2. Deep learning in computer vision: A critical review of emerging

    Latest Stage (2019-now) and Research Trends. ... They commented in their paper "computer vision is already being put to questionable use and as researchers, we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much".

  3. Computer Vision and Pattern Recognition

    VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters. Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu. Comments: 26 pages, 11 figures. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

  4. Computer Vision and Image Processing: A Paper Review

    This paper provides contribution of recent development on reviews related to computer vision, image processing, and their related studies. We categorized the computer vision mainstream into four ...

  5. Unveiling the Vision: A Comprehensive Review of Computer Vision in AI

    In the era of rapid technological advancement, Computer Vision has emerged as a transformative force, reshaping the landscape of Artificial Intelligence (AI) and Machine Learning (ML). This comprehensive review paper aims to delve into the intricate evolution, methodologies, applications, challenges, and future trajectories of Computer Vision. Moving beyond a mere exploration of technical ...

  6. Home

    Overview. International Journal of Computer Vision (IJCV) details the science and engineering of this rapidly growing field. Regular articles present major technical advances of broad general interest. Survey articles offer critical reviews of the state of the art and/or tutorial presentations of pertinent topics. Coverage includes:

  7. The Top 10 Computer Vision Papers of 2021

    A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference. - louisfb01/top-10-cv-papers-2021 ... If you'd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

  8. CVIU

    The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image ...

  9. Top Computer Vision Papers of All Time (Updated 2024)

    Top Computer Vision Papers of All Time (Updated 2024) Nico Klingler. March 12, 2024. Today's boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection ...

  10. IET Computer Vision

    IET Computer Vision is a fully open access journal that introduces new horizons and sets the agenda for future avenues of research in a wide range of areas of computer vision. We are a fully open access journal that welcomes research articles reporting novel methodologies and significant results of interest.

  11. Deep learning-enabled medical computer vision

    Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53,54,55 that extends into all modalities, with a focus on X ...

  12. Machine Learning in Computer Vision

    The machine learning and computer vision research is still evolving [1]. Computer vision is an essential part of Internet of Things, Industrial Internet of Things, and brain human interfaces. The complex human activities are recognized and monitored in multimedia streams using machine learning and computer vison.

  13. Computer Vision and Image Processing

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Computer vision is a scientific ...

  14. [2101.01169] Transformers in Vision: A Survey

    Transformers in Vision: A Survey. Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah. View a PDF of the paper titled Transformers in Vision: A Survey, by Salman Khan and 5 other authors. Astounding results from Transformer models on natural language tasks have intrigued the vision community to study ...

  15. [2212.05153] Algorithmic progress in computer vision

    View a PDF of the paper titled Algorithmic progress in computer vision, by Ege Erdil and Tamay Besiroglu. We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the ...

  16. Top 10 Computer Vision Papers 2020

    The top 10 computer vision papers in 2020 with video demos, articles, code, and paper reference. The top 10 computer vision papers in 2020 with video demos, articles, code, and paper reference. ... we still had the chance to see a lot of amazing research come out. Especially in the field of artificial intelligence and more precisely computer ...

  17. The application of deep learning in computer vision

    As the deep learning exhibits strong advantages in the feature extraction, it has been widely used in the field of computer vision and among others, and gradually replaced traditional machine learning algorithms. This paper first reviews the main ideas of deep learning, and displays several related frequently-used algorithms for computer vision. Afterwards, the current research status of ...

  18. Attention mechanisms in computer vision: A survey

    Meng-Hao Guo is a Ph.D. candidate supervised by Prof. Shi-Min Hu in the Department of Computer Science and Technology at Tsinghua University, Beijing, China. His research interests include computer graphics, computer vision, and machine learning. Tian-Xing Xu received his bachelor degree in computer science from Tsinghua University in 2021. He is currently a Ph.D. candidate in the Department ...

  19. CVF Open Access

    Computer Vision Foundation. These research papers are the Open Access versions, provided by the Computer Vision Foundation. Except for the watermark, they are identical to the accepted versions; the final published version of the proceedings is available on IEEE Xplore. This material is presented to ensure timely dissemination of scholarly and ...

  20. 1 Transformers in Vision: A Survey

    Transformers in Vision: A Survey. had Shahbaz Khan, and Mubarak ShahAbstract—Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their appli. ation to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence ...

  21. A computer vision-aided methodology for bridge flexibility

    Computer-Aided Civil and Infrastructure Engineering is a civil engineering journal bridging advances in computer technology with civil & infrastructure engineering. Abstract This paper presents the implementation of a novel monitoring system in which video images and conventional sensor network data are simultaneously analyzed to identify the ...

  22. Advanced Topics in Computer Science: Recent Advances in Computer Vision

    We select a few advanced computer vision topics to explore, focusing in particular on the robustness, transparency and fairness of computer vision systems. Students are expected to routinely read and present research papers, with special attention on developing excellent oral and written scientific communication skills

  23. [2111.07624] Attention Mechanisms in Computer Vision: A Survey

    Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks ...

  24. SAM 2: Segment Anything in Images and Videos

    Research Topics. Computer Vision. Related Publications. August 20, 2024. CONVERSATIONAL AI NLP. ... Read the Paper. July 29, 2024. COMPUTER VISION CORE MACHINE LEARNING. Factorizing Text-to-Video Generation by Explicit Image Conditioning. Rohit Girdhar, Mannat Singh, ...

  25. Guest Editorial: Special Issue on Open-World Visual Recognition

    2.1 Open-Set & Open-Vocabulary Recognition. This part of the issue consists of 11 papers. The first article, by Zhang et al., introduces the Open-Vocabulary Keypoint Detection (OVKD) task and proposes a novel framework based on semantic-feature matching, which combines vision and language models to link language features with local visual keypoint features.

  26. Deep Learning vs. Traditional Computer Vision

    traditional computer vision techniques which had been undergoing progressive development in years prior to the rise of DL have become obsolete. This paper will analyse the benefits and drawbacks of each approach. The aim of this paper is to promote a discussion on whether knowledge of classical computer vision techniques should be maintained.

  27. Industry and Academic Research in Computer Vision

    impact on computer vision research is largely unknown due to the lack of relevant data and formal studies. Therefore, the goal of this study is two-fold: to quantify the share of industry-sponsored research in the field of computer vision and to understand whether industry presence has a measurable effect on the way the field is developing.

  28. [2408.17046] Text-to-Image Generation Via Energy-Based CLIP

    Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function ...