Crowdsourcing Dermatology Images with Google Search Ads: Creating a Real-World Skin Condition Dataset (2402.18545v1)
Abstract: Background: Health datasets from clinical sources do not reflect the breadth and diversity of disease in the real world, impacting research, medical education, and AI tool development. Dermatology is a suitable area to develop and test a new and scalable method to create representative health datasets. Methods: We used Google Search advertisements to invite contributions to an open access dataset of images of dermatology conditions, demographic and symptom information. With informed contributor consent, we describe and release this dataset containing 10,408 images from 5,033 contributions from internet users in the United States over 8 months starting March 2023. The dataset includes dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and Monk Skin Tone (eMST) labels for the images. Results: We received a median of 22 submissions/day (IQR 14-30). Female (66.72%) and younger (52% < age 40) contributors had a higher representation in the dataset compared to the US population, and 32.6% of contributors reported a non-White racial or ethnic identity. Over 97.5% of contributions were genuine images of skin conditions. Dermatologist confidence in assigning a differential diagnosis increased with the number of available variables, and showed a weaker correlation with image sharpness (Spearman's P values <0.001 and 0.01 respectively). Most contributions were short-duration (54% with onset < 7 days ago ) and 89% were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. The dataset is available at github.com/google-research-datasets/scin . Conclusion: Search ads are effective at crowdsourcing images of health conditions. The SCIN dataset bridges important gaps in the availability of representative images of common skin conditions.
- Laura Akers and Judith S Gordon “Using Facebook for Large-Scale Online Randomized Clinical Trial Recruitment: Effective Advertising Strategies” In J. Med. Internet Res. 20.11, 2018, pp. e290
- “Conducting a fully mobile and randomised clinical trial for depression: access, engagement and expense” In BMJ Innov 2.1, 2016, pp. 14–21
- “The Validity of Google Trends Search Volumes for Behavioral Forecasting of National Suicide Rates in Ireland” In Int. J. Environ. Res. Public Health 16.17, 2019
- “Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection” In Sci Data 10.1, 2023, pp. 397
- “Black & brown skin” Accessed: 2023-10-30 In Black & brown skin, https://www.blackandbrownskin.co.uk/
- “Dermatology in Rural Settings: Organizational, Clinical, and Socioeconomic Perspectives” Springer Nature, 2021
- “Who searches the internet for health information?” In Health Serv. Res. 41.3 Pt 1, 2006, pp. 819–836
- “Sources of bias in artificial intelligence that perpetuate healthcare disparities-A global review” In PLOS Digit Health 1.3, 2022, pp. e0000022
- A Y Chang, S K Kiprono and T A Maurer “Providing dermatological care in resource-limited settings: barriers and potential solutions” In Br. J. Dermatol. 177.1, 2017, pp. 247–248
- “Disparities in dermatology AI performance on a diverse, curated clinical image set” In Sci Adv 8.32, 2022, pp. eabq6147
- “Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review” In JAMA Dermatol. 157.11, 2021, pp. 1362–1369
- “Detect explicit content (SafeSearch)” Accessed: 2023-11-18 In Google Cloud, https://cloud.google.com/vision/docs/detecting-safe-search
- “Light Field Image Dataset of Skin Lesions” In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) IEEE, 2019, pp. 3905–3908
- “Increasing utilization of dermatologists by managed care: an analysis of the National Ambulatory Medical Care Survey, 1990-1994” In J. Am. Acad. Dermatol. 37.5 Pt 1, 1997, pp. 784–788
- Susannah Fox “The Social Life of Health Information, 2011” Accessed: 2023-11-20 In Pew Research Center: Internet, Science & Tech, https://www.pewresearch.org/internet/2011/05/12/the-social-life-of-health-information-2011/, 2011
- “What Predicts Online Health Information-Seeking Behavior Among Egyptian Adults? A Cross-Sectional Study” In J. Med. Internet Res. 19.6, 2017, pp. e216
- “Detecting influenza epidemics using search engine query data” In Nature 457.7232, 2009, pp. 1012–1014
- “Successful participant recruitment strategies for an online smokeless tobacco cessation program” In Nicotine Tob. Res. 8 Suppl 1, 2006, pp. S35–41
- “Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset”, 2021 arXiv:2104.09957 [cs.CV]
- “Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm”, 2022 arXiv:2207.02942 [cs.CV]
- “Bias in, bias out: Underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection-A scoping review” In J. Am. Acad. Dermatol. 87.1, 2022, pp. 157–159
- “Augmented Intelligence Dermatology: Deep Neural Networks Empower Medical Professionals in Diagnosing Skin Cancer and Predicting Treatment Options for 134 Skin Disorders” In J. Invest. Dermatol. 140.9, 2020, pp. 1753–1761
- “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison”, 2019 arXiv:1901.07031 [cs.CV]
- “Development and Assessment of an Artificial Intelligence-Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices” In JAMA Netw Open 4.4, 2021, pp. e217249
- Susan Jasper “How we detect, remove and report child sexual abuse material” Accessed: 2023-11-18 In Google, https://blog.google/technology/safety-security/how-we-detect-remove-and-report-child-sexual-abuse-material/, 2022
- “MIMIC-III, a freely accessible critical care database” In Sci Data 3, 2016, pp. 160035
- “Racial underrepresentation in dermatological datasets leads to biased machine learning models and inequitable healthcare” In J. Biomed. Res. 3.1, 2022, pp. 42–47
- “Know Your Data” Accessed: 2023-11-20, https://knowyourdata.withgoogle.com/docs/
- “A deep learning system for differential diagnosis of skin diseases” In Nat. Med. 26.6, 2020, pp. 900–908
- “PH2 - a dermoscopic image database for research and benchmarking” In Conf. Proc. IEEE Eng. Med. Biol. Soc. 2013, 2013, pp. 5437–5440
- Ellis Monk “The Monk Skin Tone Scale”, 2023
- “Impact of store-and-forward (SAF) teledermatology on outpatient dermatologic care: A prospective study in an underserved urban primary care setting” In J. Am. Acad. Dermatol. 74.3, 2016, pp. 484–90.e1
- “Evaluation of the Number-Needed-to-Biopsy Metric for the Diagnosis of Cutaneous Melanoma: A Systematic Review and Meta-analysis” In JAMA Dermatol. 155.10, 2019, pp. 1167–1174
- “A study of internet searches for medical information in dermatology patients: The patient–physician relationship” In Actas Dermo-Sifiliográficas (English Edition) 106.6, 2015, pp. 493–499
- “PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones” In Data Brief 32, 2020, pp. 106221
- John Paparrizos, Ryen W White and Eric Horvitz “Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results” In J. Oncol. Pract. 12.8, 2016, pp. 737–744
- “Unreliability of self-reported burning tendency and tanning ability” In Arch. Dermatol. 124.6, 1988, pp. 885–888
- Emilie Renahy, Isabelle Parizot and Pierre Chauvin “Health information seeking on the Internet: a double divide? Results from a representative survey in the Paris metropolitan area, France, 2005-2006” In BMC Public Health 8, 2008, pp. 69
- “A patient-centric dataset of images and metadata for identifying melanomas using clinical context” In Sci Data 8.1, 2021, pp. 34
- “Machine-learned epidemiology: real-time detection of foodborne illness at scale” In NPJ Digit Med 1, 2018, pp. 36
- Klaus Sellheyer and Wilma F Bergfeld “A retrospective biopsy study of the clinical diagnostic accuracy of common skin diseases by different specialties compared with dermatology” In J. Am. Acad. Dermatol. 52.5, 2005, pp. 823–830
- “CheXclusion: Fairness gaps in deep chest X-ray classifiers” In Pac. Symp. Biocomput. 26, 2021, pp. 232–243
- “Skin Deep” Accessed: 2023-10-30 In Skin Deep DFTB Skin Deep, https://dftbskindeep.com/, 2020
- “AI-based localization and classification of skin disease with erythema” In Sci. Rep. 11.1, 2021, pp. 5350
- “OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th International Workshop, CLIP 2018, Third International Workshop, ISIC 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16 and 20, 2018, Proceedings” Springer, 2018
- “Survey of Physician Appointment Wait Times and Medicare and Medicaid Acceptance Rates” Accessed: 2023-11-18, https://www.wsha.org/wp-content/uploads/mha2022waittimesurveyfinal.pdf, 2022
- Philipp Tschandl, Cliff Rosendahl and Harald Kittler “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions” In Sci Data 5, 2018, pp. 180161
- United States Census Bureau ¿ Communications Directorate - Center for New Media “QuickFacts: United States”
- “Using Google Ads to recruit and retain a cohort considering abortion in the United States” In Contracept X 2, 2020, pp. 100017
- “Development and Clinical Evaluation of an Artificial Intelligence Support Tool for Improving Telemedicine Photo Quality” In JAMA Dermatol. 159.5, 2023, pp. 496–503
- Abigail Walker, Claire Hopkins and Pavol Surda “Use of Google Trends to investigate loss-of-smell-related searches during the COVID-19 outbreak” In Int. Forum Allergy Rhinol. 10.7, 2020, pp. 839–847
- “The first images of atopic dermatitis: an attempt at retrospective diagnosis in dermatology” In J. Am. Acad. Dermatol. 53.4, 2005, pp. 684–689
- Xun Wang and Robin A Cohen “Health Information Technology Use Among Adults: United States, July-December 2022”, Wang,Xun,andRobinA.Cohen.n.d.‘‘HealthInformationTechnologyUseAmongAdults:’’https://doi.org/10.15620/cdc:133700., 2023
- “Characteristics of publicly available skin cancer image datasets: a systematic review” In Lancet Digit Health 4.1, 2022, pp. e64–e74
- Ryen W White and Eric Horvitz “Evaluation of the Feasibility of Screening Patients for Early Signs of Lung Carcinoma in Web Search Logs” In JAMA Oncol 3.3, 2017, pp. 398–401
- “The burden of skin and subcutaneous diseases: findings from the global burden of disease study 2019” In Front Public Health 11, 2023, pp. 1145513
- Abbi Ward (3 papers)
- Jimmy Li (6 papers)
- Julie Wang (3 papers)
- Sriram Lakshminarasimhan (2 papers)
- Ashley Carrick (1 paper)
- Bilson Campana (1 paper)
- Jay Hartford (2 papers)
- Pradeep Kumar S (3 papers)
- Tiya Tiyasirichokchai (2 papers)
- Sunny Virmani (3 papers)
- Renee Wong (5 papers)
- Yossi Matias (61 papers)
- Dawn Siegel (1 paper)
- Steven Lin (6 papers)
- Justin Ko (22 papers)
- Alan Karthikesalingam (31 papers)
- Christopher Semturs (12 papers)
- Pooja Rao (14 papers)
- Greg S. Corrado (37 papers)
- Dale R. Webster (20 papers)