Kepler Legacy Sample Overview

Updated 30 August 2025

Kepler Legacy Sample is a benchmark collection of solar-like stars exhibiting high-S/N oscillations that enable precise determination of stellar properties via asteroseismology.
Advanced methodologies, including Bayesian MCMC peak-bagging and frequency ratio analysis, extract oscillation parameters with typical uncertainties of 2-10%.
The sample underpins exoplanet occurrence studies by linking stellar parameters to planet statistics and calibrating models of stellar evolution and galactic chemical enrichment.

The Kepler Legacy Sample is a benchmark set of main-sequence, solar-like stars for which the Kepler Mission delivered oscillation data of unprecedented quality. This sample underpins many advancements in both exoplanet and stellar astrophysics, linking exoplanet occurrence rates to stellar parameters and providing tight constraints on fundamental stellar properties through asteroseismology.

1. Definition and Scope of the Kepler Legacy Sample

The Kepler Legacy Sample refers to a subset of stars observed by NASA’s Kepler space telescope that exhibit solar-like oscillations and have been subjected to long, continuous photometric monitoring, yielding the most precise asteroseismic data collected to date. The canonical sample consists of 66 main-sequence targets observed for a minimum of one year with Kepler short-cadence data, complemented by the Sun for comparative purposes (Aguirre et al., 2016, Lund et al., 2016). Additional legacy datasets include carefully characterized bright giants and very bright dwarfs observed via collateral modes (e.g., Kepler Smear Campaign (Pope et al., 2019)), and specialized subsamples focusing on, for example, gravity-mode pulsators (Gebruers et al., 2021).

The sample is foundational for exoplanet occurrence studies, asteroseismic determinations of masses and ages, calibration of chemical enrichment laws, mixing-length parameters, and modeling of convective and radiative transport.

2. Formation and Construction of the Sample

The Legacy Sample was selected for stars with the highest signal-to-noise solar-like oscillations in the Kepler field, mostly FGK dwarfs. Criteria included observed solar-like p-mode oscillations with high radial order coverage and continuous data spanning at least one year in short-cadence mode. Each star's oscillation power spectrum allows identification of tens of individual oscillation frequencies – an exceptional data regime for solar-like stars.

Data reduction involved advanced background correction, power spectrum smoothing, and rigorous global Bayesian MCMC “peak-bagging” to extract mode parameters (frequency, amplitude, linewidth, visibilities) (Lund et al., 2016). Uncertainties on oscillation frequencies reach near theoretical precision limits (∼20% above analytic minimum estimates). For subsets (e.g., red giants or bright dwarfs), alternative data acquisition techniques such as use of Kepler’s collateral smear data have been used (Pope et al., 2019).

The sample is complemented by extensive follow-up: interferometric angular diameters, high S/N, high-dispersion spectroscopic abundances, and Gaia astrometry (Nissen et al., 2017, Morel et al., 2020). All data and results for the sample are made fully public.

3. Asteroseismic Analysis: Methodologies and Systematics

Mode parameters (frequencies, amplitudes, linewidths) are extracted by detailed modeling of the frequency domain power spectra using Lorentzian profiles for each mode, along with polynomial or Harvey-like models for the background signal. Bayesian Markov Chain Monte Carlo (MCMC) algorithms sample the multi-dimensional posterior space, incorporating astrophysically informed priors and parallel tempering to ensure robust convergence (Lund et al., 2016).

Modeling teams employ a variety of stellar evolution codes (MESA, ASTEC, GARSTEC, Cesam2k) and seismic inputs (individual frequencies, frequency ratios, global parameters). Corrections for near-surface effects (Kjeldsen et al., Ball & Gizon formulations) are critical. Frequency difference ratios such as $r_{01}$ and $r_{02}$ are preferred, as they minimize sensitivity to modeling uncertainties in near-surface layers and probe the core structure directly (Lund et al., 2016, Aguirre et al., 2016).

Inter-team comparisons show that despite the methodological variety, radii, masses, and ages derived from the sample agree within typical uncertainties of ∼2%, ∼4%, and ∼10% respectively (Aguirre et al., 2016). However, frequency extraction is not immune to systematic effects stemming from choices in power spectrum computation, mode height ratio adoption, and treatment of low signal-to-noise modes (Roxburgh, 2016, Roxburgh, 2017). Studies highlight anomalies in the covariance matrices and error estimates supplied by some pipelines (Roxburgh, 2017), with up to 27 of 66 stars exhibiting non-positive definite matrices and error estimates on separation ratios sometimes exceeding theoretical limits.

4. Seismic Diagnostics and Stellar Modeling Impact

The primary astero-seismic metrics derived from the Kepler Legacy Sample include:

Large frequency separation ( $\langle \Delta\nu \rangle$ ) and its scaling with mean density: $\langle \Delta\nu \rangle \propto (M/M_\odot)^{1/2} (R/R_\odot)^{-3/2}$ .
Small frequency separations ( $\delta\nu_{02}$ , $\delta\nu_{01}$ ) and their frequency separation ratios $r_{01}$ , $r_{10}$ , $r_{02}$ , which are tightly linked to core structure and evolutionary state.
Oscillation mode amplitudes and linewidths, providing insight into mode excitation/damping (tied to convection zone properties).
Seismic “glitch” signatures from the base of the convection zone and helium ionization zone, used to infer acoustic depths and envelope helium content (Verma et al., 2017).

Stellar properties (radii, masses, ages) derived from the sample have been validated against solar values, interferometric angular diameters, Gaia-based parallaxes, and binary stars, confirming the asteroseismic scaling relations and models (Aguirre et al., 2016, Nissen et al., 2017). Machine learning regressors have been demonstrated to yield highly precise parameters for the sample (median uncertainties of 1.7% in radius, 3.6% in mass, and 14.8% in age) (Bellinger et al., 2017).

Asteroseismic inversion techniques, including Subtractive Optimally Localized Averages (SOLA) and Optimally Localized Averages (OLA), have been applied to invert for density, acoustic radius, and core structure. These methods can minimize surface-effect contributions and, especially in post-MS F stars with mixed modes, have revealed discrepancies of up to 5% in core density relative to evolutionary grid models (Buldgen et al., 2017, Kosovichev et al., 2020).

5. Chemical Abundances, Age Relations, and Mixing Processes

Detailed abundance analyses using high-resolution, high S/N spectroscopy underpin the paper of galactic chemical evolution with the Legacy sample. Element-by-element abundances (e.g., C, O, Na, Mg, Al, Si, Ca, Ti, Cr, Fe, Ni, Zn, Y) are determined using rigorous non-LTE corrections and differential analysis relative to the Sun or internal calibrators (Nissen et al., 2017). Strong trends of [Mg/Fe], [Al/Fe], [Zn/Fe] decreasing with age and of [Y/Mg], [Y/Al] increasing with decreasing age are confirmed, in accord with delayed iron production from SN Ia and increasing s-process yields from AGB stars. These abundance ratios act as “chemical clocks.”

Testing of empirical abundance–age relations shows that for the broader range of FGK dwarfs, such calibrations (including [Y/Mg], [Y/Al], [Sr/Mg]) currently yield ages with residual scatters of 1.5–2 Gyr compared to seismic determinations, improving to ∼0.5 Gyr with multi-parameter (2D–3D) calibrations but retaining systematic offsets (seismic ages being ∼0.7 Gyr higher) (Morel et al., 2020). Systematic errors in input $T_{\rm eff}$ and Fe/H propagate into seismic modeling and age inferences, further complicating the use of abundance–age relations at high precision.

Lithium abundances, in combination with seismic indicators, provide stringent probes of internal mixing. For G-type solar analogues, modest turbulent mixing (of solar-like efficiency) in the radiative zone suffices to explain lithium depletion and seismic diagnostics simultaneously. For more massive F-type stars, however, significant convective penetration is needed to match lithium depletion, but seismic frequency separation ratios ( $r_{01}$ ) constrain the extent of this mixing—large values required by lithium are incompatible with the global seismic properties, highlighting a modeling challenge (Buldgen et al., 25 Aug 2025).

6. Implications for Exoplanet Science and Population Studies

The Kepler Legacy Sample forms the empirical backbone of exoplanet occurrence studies. Over 3,500 transiting exoplanet candidates were identified from three years of data, of which about 100 are in the habitable zone, with 85–90% catalog reliability (Batalha, 2014). The catalog records precise transit-derived parameters and robustly computes properties such as size, period, semi-major axis, and insolation flux.

Kepler's discoveries revealed a previously unexplored parameter space of planets smaller than Neptune (85% of Kepler planets), in stark contrast to non-Kepler surveys where 86% of planets were larger than Neptune. These findings, including the prevalence of multi-planet systems (22% of Kepler targets), established the dominance of small planets in the Galaxy and provided critical data to compute occurrence rates as a function of size, period, and spectral type.

Follow-up validation (high-resolution spectroscopy, imaging, and Doppler velocimetry), statistical false-positive analyses, and characterization of multi-planet system architectures (period ratios, size ratios, resonance structures) have transformed the empirical and theoretical landscape of planetary system formation. The high-fidelity sample enables direct calibration of exoplanet metallicity trends, occurrence statistics, and planet population structure (e.g., the “radius gap” at 1.9 $R_\oplus$ ) (Hansen et al., 2020).

7. Future Prospects and Legacy

As the best-monitored sample of solar-like oscillators, the Kepler Legacy Sample underpins ongoing and future asteroseismic investigations, including the modeling of transport processes (e.g., angular momentum, convective overshooting, turbulence), core and envelope mixing, and the refinement of input physics for stellar evolution modeling (Buldgen et al., 2017, Gebruers et al., 2021, Buldgen et al., 25 Aug 2025).

The methods and systematics tested in the sample—surface correction formulations, treatment of glitches, inversion diagnostics, and chemical tagging—set the baseline for upcoming large-scale missions such as PLATO, which will expand the sample of high-quality oscillators by orders of magnitude. The rigorous vetting of systematic errors in extracted mode frequencies and covariances catalyzes further pipeline development for both oscillation and exoplanet catalog analysis.

The Kepler Legacy Sample will remain a reference benchmark for model calibration, asteroseismic scaling relation validation, and synthetic population tests—serving as a bridge between exoplanet demographics, galactic archaeology, and the physics of stellar interiors.