Seeking Spectroscopic Binaries with Data-Driven Models

Published 11 Dec 2025 in astro-ph.SR and astro-ph.IM | (2512.11043v1)

Abstract: Data-driven stellar classification has a long and important history in astronomy, dating as far back as Annie Jump Cannon's "by eye" classifications of stars into spectral types still used today. In recent years, data-driven spectroscopy has proven to be an effective means of deriving stellar properties for large samples of stars, sidestepping issues with computational efficiency, incomplete line lists, and radiative transfer calculations associated with physical stellar models. A logical application of these algorithms is the detection of unresolved stellar binaries, which requires accurate spectroscopic models to resolve flux contributions from a fainter secondary star in the spectrum. Here we use The Cannon to train a data-driven model on spectra from the Keck High Resolution Echelle Spectrometer. We show that our model is competitive with existing data-driven models in its ability to predict stellar properties Teff, stellar radius, [Fe/H], vsin(i), and instrumental PSF, particularly when we apply a novel wavelet-based processing step to spectra before training. We find that even with accurate estimates of star properties, our model's ability to detect unresolved binaries is limited by its approx. 3% accuracy in per-pixel flux predictions, illuminating possible limitations of data-driven model applications.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a data-driven model adapting The Cannon framework to uncover unresolved spectroscopic binaries in Keck HIRES spectra.
It employs innovative wavelet filtering and targeted pixel masking to enhance astrophysical signals and mitigate instrumental systematics.
Results show robust stellar label transfer yet limited binary discrimination, underscoring the need for richer training sets and refined modeling techniques.

Spectroscopic Binary Identification Using Data-Driven Models

Introduction

The identification and characterization of binary stellar systems are fundamental to understanding stellar evolution, planet demographics, and galactic structure. Traditional approaches to binary detection using high-resolution spectroscopy are limited in their sensitivity to unresolved systems, particularly those with small radial velocity offsets and overlapping spectral features. The paper "Seeking Spectroscopic Binaries with Data-Driven Models" (2512.11043) systematically investigates whether data-driven models, specifically adaptations of The Cannon framework, can detect unresolved spectroscopic binaries in the Keck HIRES dataset, focusing on planet-hosting stars from the California-Kepler Survey (CKS).

Data-Driven Spectral Emulation Framework

Training Set Construction and Preprocessing

The foundation of the approach is a high-SNR HIRES spectral library, SpecMatch-Emp, carefully purged of previously identified binaries through cross-matching with SIMBAD, Gaia, and ancillary catalogs. The model parameters encompass $T_{\mathrm{eff}}$ , radius, [Fe/H], $v\sin i$ , and the instrumental PSF, enabling flux prediction per pixel.

A significant innovation in preprocessing is a wavelet-filtering procedure to remove low-frequency variations (Figure 1). Unlike conventional continuum normalization, wavelet-filtering decomposes each spectral order, excising the lowest-frequency approximation coefficients, efficiently mitigating night-to-night instrumental systematics while preserving astrophysical features.

Figure 1: Schematic overview of the wavelet-based filtering used to suppress low-frequency, non-astrophysical HIRES spectral variations.

Empirical verification demonstrates the filtering’s efficacy; wavelet-filtered spectra exhibit reduced variance between repeat observations, eliminating systematics without degrading astrophysical content (Figure 2).

Figure 2: Night-to-night HIRES spectra pre- and post-wavelet filtering, showing effective suppression of non-astrophysical variability.

Pixel Masking for Astrophysical Sensitivity

To avoid telluric contamination, a pixel mask is employed, excluding wavelength ranges susceptible to time-variable absorption. Critically, a second mask isolates ‘binary signature’ pixels—wavelengths where synthetic composite spectra of binaries deviate maximally from single-star analogs, based on a PHOENIX-based spectral grid (Figure 3).

Figure 3: HIRES spectrum of a sample star with telluric and binary signature pixels highlighted, identifying the spectral regions of highest binary sensitivity.

Model Architecture and Fitting

The main algorithmic tool is a piecewise, quadratic-in-labels version of The Cannon, splitting the parameter space for hot and cool stars to mitigate the quadratic fitting’s limitations over wide parameter ranges. During optimization, unconstrained Doppler velocities are introduced to compensate for registration uncertainties in the rest-frame shifting. The model is applied to both single-star and two-component (binary) fits, with the latter modeled as a weighted linear combination of The Cannon outputs for each component, with weights given by their V-band flux contributions.

Quantitative Evaluation of Stellar Label Transfer

A core diagnostic is label transfer: can The Cannon, trained on SpecMatch-Emp, accurately infer stellar parameters? Leave-20%-out validation finds that wavelet-filtered models achieve RMS scatter of 27 K in $T_{\mathrm{eff}}$ , 0.10 $R_\odot$ in radius, 0.09 dex in [Fe/H], and 0.87 km/s in $v\sin i$ , robustly matching or outperforming extant models. Gains are amplified at moderate S/N ( $\sim$ 45 per pixel), typical of CKS spectra, where wavelet filtering sharply reduces bias and outliers in [Fe/H] especially.

Binary Sensitivity and Limitations

Validation on Known Single and Binary Stars

The model's capacity to flag binaries is evaluated on a validation set: 102 bona fide single stars (Raghavan et al.), S/N-matched to the 97 CKS binaries with well-characterized companions (Sullivan et al. catalog). For each star, the difference in Bayesian Information Criterion (BIC) between single and binary fits, and the improvement fraction $f_{\rm imp}$ , are measured (Figure 4).

Figure 4: Distribution of BIC difference versus improvement fraction across validation samples, highlighting limited separation power between singles and binaries.

A critical finding is that while the binary signature mask tightens the BIC distribution for single stars, significant overlap persists; the empirical distributions of $\Delta$ BIC and $f_{\rm imp}$ for singles and binaries are not well-separated, indicating a pronounced false positive rate. Only binaries with close separations and moderate flux ratios exhibit systematically elevated $\Delta$ BIC (Figure 5).

Figure 5: Magnitude difference $\Delta m_K$ versus separation among validation binaries, with color-coding by BIC difference.

Examination of fit residuals reveals that best-fit single and binary models for both single-star and binary cases agree with observed spectra only at the 2–5% level (Figure 6), which is comparable to or less than the expected secondary star flux contribution for most binaries in the sample.

Figure 6: HIRES spectra and best-fit single and binary model residuals for exemplars; both yield residuals far exceeding flux uncertainties, limiting discriminability.

Theoretical Sensitivity from Simulated Data

Synthetic experiments using ab initio PHOENIX models clarify the fundamental limit: secondary companions cooler than $\sim$ 3500 K contribute less than the median per-pixel noise (S/N $\sim$ 45 implies $\sim$ 2% flux error), making them intrinsically undetectable (Figure 7).

Figure 7: Predicted secondary-to-total flux fractions as a function of stellar temperature, marking the photon noise floor for detectability.

The simulated binary likelihood analysis (Figure 8) demonstrates strong discrimination only for $0.1 < f_2/f_1 < 0.6$ and for systems with sufficiently distinct $T_{\rm eff}$ , confirming that for realistic S/N and intrinsic model scatter, only a subset of binaries are theoretically detectable.

Figure 8: BIC improvement for simulated binaries, showing significant values only for intermediate flux ratios and confirming selection effects.

Discussion

Factors Limiting Binary Identification

The dominant limitation arises from the model’s per-pixel flux accuracy, which sits at $\sim$ 2–3%. This is nominally sufficient for robust label transfer but insufficient for reliably identifying the $\sim$ 2–10% spectral distortions induced by most unresolved companions. Unmodeled systematics, finite training library coverage, incomplete label precision, and possible loss of discriminatory power due to wavelet-filtering (particularly in the depths of broad features) exacerbate the incompleteness in binary detection.

The comparative higher success of analogous methods in APOGEE is attributed to higher homogeneity in survey data, more stable instrumental profiles, and larger, consistently labeled training sets, which are not matched in HIRES data's higher latent dimensionality and instrumental instability.

Prospects for Improved Spectral Emulation

Augmenting model complexity (e.g., neural architectures such as The Payne) holds promise for improved emulation but is challenging with current sample sizes, carrying significant overfitting risk. Alternative approaches—Gaussian process modeling, higher polynomial order, or training with data from next-generation stabilized spectrographs (e.g., HARPS-N, KPF)—could yield material improvements.

Expanding and homogenizing training libraries, refining continuum normalization to preserve all binary-sensitive features, and better modeling of instrumental systematics are necessary prerequisites for advancing detection completeness. These improvements are especially critical for upcoming surveys focusing on exoplanet host characterization and Galactic archeology.

Conclusion

The study rigorously demonstrates that quadratic data-driven models, even with sophisticated preprocessing like wavelet-filtering, deliver high-fidelity label transfer for modest-S/N HIRES spectra. However, unavoidable limits in per-pixel flux accuracy and training set quality preclude confident, automated identification of the majority of spectroscopic binaries in current CKS data. Theoretical simulations verify that binary detectability is inherently limited to moderate flux ratios and distinct spectral types, constrained by S/N and model accuracy. Future advances hinge on larger, uniformly labeled libraries, physically motivated feature extraction, and higher complexity, instrument-specific modeling—pivotal for robust spectroscopic binary population mapping in the era of massive spectroscopic surveys.

Markdown Report Issue