Final Project
4/24/25
Problem Description:
Florida is one of the world’s premier tourism destinations, attracting well over 100 million visitors annually. Yet visitor counts vary dramatically by season—peaking in winter months when travelers escape colder climates, and again in summer when families take vacations between school sessions. This volatility poses operational and strategic challenges for a wide range of stakeholders, from hotel and theme-park operators to transportation agencies and local governments. In an increasingly digital world, consumers signal their travel intentions earlier—often via online search engines. Google Trends data therefore offers a real-time, high-frequency proxy for consumer interest in “Florida vacation” and “beach vacation.” This project seeks to harness those digital search signals to build a timelier, more responsive forecasting model of weekly visitor volumes in Florida. By demonstrating that search interest—especially in “beach” vacations—correlates strongly and positively with real-world tourism counts, we can equip industry planners with an early-warning indicator. Such a tool can improve operational efficiency, enhance marketing targeting, and bolster economic planning for one of the nation’s most tourism-dependent economies.
H₀: There is no relationship between online search interest and quarterly tourist volume in Florida.
H₁: Higher Google search interest is positively associated with higher quarterly visitor counts.
Related Work:
The work here will be focused on Regression and Correlation found in Module 5. We also used descriptive statistics & visualization to understand distributions and seasonal patterns. Not to mention, visualizing graphs such as utilizing ggplot2 found in Module 7.
Solution:
First, the data was sourced by going on Google Trends. Google Trends is a free, web-based tool provided by Google that allows users to examine the popularity of specific search terms over time and across geographic regions. At first, I tried to access through the gtrendsR API in R. However, I encountered issues with authentication and limitations in pulling consistent long-term data for multiple keywords. As a result, I opted for a more stable approach by manually exporting the data for each keyword ("Florida vacation", "Beach vacation", and "Things to do in Florida") from the Google Trends interface in weekly format. These CSV files were then imported into R, combined, and cleaned using tidyverse functions to prepare for analysis.
Because weekly data points were used, more than 200 observations was initially sourced. Hence, the data for the phrase "Things to do in Florida" was omitted out and "Florida vacation" and "Beach vacation" were used for this project.
Step 4 begins by loading the merged weekly Google Trends dataset from “weekly_google_trends_merged.csv,” which covers 316 weeks between January 6, 2019, and December 29, 2024. The two key predictors—search_florida and search_beach—are measured on Google’s 0–100 relative-interest scale. Summary statistics show that search_floridaranges from 19 to 100 (median 45, mean 46.9, interquartile range 34.8 – 57.0), while search_beach ranges from 18 to 100 (median 53.5, mean 54.2, interquartile range 37.0 – 68.0). The wider spread and higher central tendency of the beach searches indicate more pronounced seasonal spikes, and occasional 100-point values in both series point to event-driven surges (for example, spring break or peak summer travel).
In order to visualize temporal patterns, the two search series were reshaped into long format using pivot_longer(), then plotted together with ggplot2. The resulting line chart reveals clear seasonality: interest rises sharply in late winter and spring, peaks in early summer, and declines toward the end of the year. The “beach” series exhibits particularly acute spikes around March (spring break) and June–July (summer vacations), whereas the general “florida” series maintains a steadier baseline—reflecting year-round inquiries about the state. These descriptive findings highlight the need to account for seasonality and to monitor outlier weeks when fitting the forthcoming regression models.
Multicollinearity was computed, and it was high with a value of 0.8.When reflecting on this, the close relationship is expected given their shared context and natural overlap, but it does not appear to skew the results or interpretations in a significantly negative way. Since the model is predictive and multicollinearity did not harm accuracy, both variables could be kept but I was curious to see the results of Lasso Regression.
Lasso Regression can be used when there is high multicollinearity between variables because it applies an L1 penalty that can shrink some coefficient estimates exactly to zero, effectively performing variable selection and reducing model complexity.At lambda.min, both predictors were retained: search_florida: +0.322 search_beach: −0.243. At lambda.1se (a more conservative model chosen by the 1-standard-error rule), search_florida was removed: (Intercept): 2.414445 search_florida: 0.000000 search_beach: . Lasso removed search_florida entirely, selecting a simpler model with only the intercept and a diminished contribution from search_beach. This process demonstrated how Lasso can both mitigate multicollinearity and enhance model interpretability by prioritizing the most unique predictive signals.
Conclusion/Abstract:
The multiple linear regression model showed that both “search_florida” and “search_beach” were statistically significant predictors of estimated weekly visitor counts, with R² = 0.1241 and a highly significant F-statistic (F(2, 313) = 22.17, p < 0.0001). Notably, “search_florida” was positively associated with tourism volume (β = 0.01936, p = 2.05 × 10⁻¹⁰), directly supporting the alternative hypothesis (H₁) that increased search activity corresponds to higher visitor counts. The negative coefficient on “search_beach” (β = –0.01299, p = 1.15 × 10⁻⁶) likely resulted from multicollinearity (correlation = 0.84), which prompted further investigation. To address this, a Lasso Regression model was implemented. At the optimal penalty (lambda.min), both predictors were retained. However, at the more conservative lambda.1se, Lasso zeroed out “search_florida,” favoring a simpler model that prioritized only “search_beach”—demonstrating the model’s ability to automatically resolve multicollinearity and preserve only the strongest predictive signal.
A key strength of this analysis lies in its use of real-time, publicly available Google Trends data to forecast visitor volume in a major tourism economy. Future enhancements could involve incorporating complementary variables such as weather patterns, airfare costs, or major event calendars to boost predictive accuracy. The methodology presented here can also be applied to other regions or industries where digital behavior reflects real-world activity, offering a scalable foundation for data-driven planning.
Comments
Post a Comment