Final Project

4/24/25

Problem Description:

Florida is one of the world’s premier tourism destinations, attracting well over 100 million visitors annually. Yet visitor counts vary dramatically by season—peaking in winter months when travelers escape colder climates, and again in summer when families take vacations between school sessions. This volatility poses operational and strategic challenges for a wide range of stakeholders, from hotel and theme-park operators to transportation agencies and local governments. In an increasingly digital world, consumers signal their travel intentions earlier—often via online search engines. Google Trends data therefore offers a real-time, high-frequency proxy for consumer interest in “Florida vacation” and “beach vacation.” This project seeks to harness those digital search signals to build a timelier, more responsive forecasting model of weekly visitor volumes in Florida. By demonstrating that search interest—especially in “beach” vacations—correlates strongly and positively with real-world tourism counts, we can equip industry planners with an early-warning indicator. Such a tool can improve operational efficiency, enhance marketing targeting, and bolster economic planning for one of the nation’s most tourism-dependent economies.

H₀: There is no relationship between online search interest and quarterly tourist volume in Florida.
H₁: Higher Google search interest is positively associated with higher quarterly visitor counts.

Related Work:

The work here will be focused on Regression and Correlation found in Module 5. We also used descriptive statistics & visualization to understand distributions and seasonal patterns. Not to mention, visualizing graphs such as utilizing ggplot2 found in Module 7.

Solution:

First, the data was sourced by going on Google Trends. Google Trends is a free, web-based tool provided by Google that allows users to examine the popularity of specific search terms over time and across geographic regions. At first, I tried to access through the gtrendsR API in R. However, I encountered issues with authentication and limitations in pulling consistent long-term data for multiple keywords. As a result, I opted for a more stable approach by manually exporting the data for each keyword ("Florida vacation", "Beach vacation", and "Things to do in Florida") from the Google Trends interface in weekly format. These CSV files were then imported into R, combined, and cleaned using tidyverse functions to prepare for analysis.

Because weekly data points were used, more than 200 observations was initially sourced. Hence, the data for the phrase "Things to do in Florida" was omitted out and "Florida vacation" and "Beach vacation" were used for this project.

Step 4 begins by loading the merged weekly Google Trends dataset from “weekly_google_trends_merged.csv,” which covers 316 weeks between January 6, 2019, and December 29, 2024. The two key predictors—search_florida and search_beach—are measured on Google’s 0–100 relative-interest scale. Summary statistics show that search_floridaranges from 19 to 100 (median 45, mean 46.9, interquartile range 34.8 – 57.0), while search_beach ranges from 18 to 100 (median 53.5, mean 54.2, interquartile range 37.0 – 68.0). The wider spread and higher central tendency of the beach searches indicate more pronounced seasonal spikes, and occasional 100-point values in both series point to event-driven surges (for example, spring break or peak summer travel).

In order to visualize temporal patterns, the two search series were reshaped into long format using pivot_longer(), then plotted together with ggplot2. The resulting line chart reveals clear seasonality: interest rises sharply in late winter and spring, peaks in early summer, and declines toward the end of the year. The “beach” series exhibits particularly acute spikes around March (spring break) and June–July (summer vacations), whereas the general “florida” series maintains a steadier baseline—reflecting year-round inquiries about the state. These descriptive findings highlight the need to account for seasonality and to monitor outlier weeks when fitting the forthcoming regression models.

Step 6 begins by importing the weekly Google Trends data and the quarterly visitor totals, then extracting “year” and “quarter” from each weekly date using lubridate. Those two datasets are joined on year and quarter, and an estimated weekly_visitors metric (in millions) is created by dividing each quarter’s total by 13 (the approximate number of weeks per quarter). A multiple linear regression of weekly_visitors on search_florida and search_beach was then fitted.

The linear regression of weekly visitor counts on Google search interest yields several noteworthy results. The model’s intercept is estimated at 2.209, indicating that when both search indices are zero, the baseline log-weekly visitors would be approximately 2.209; this estimate is highly significant (p < 2 × 10⁻¹⁶). The coefficient for “search_florida” is 0.01936 (SE = 0.00295, t = 6.57, p = 2.05 × 10⁻¹⁰), signifying that each one-unit increase in Florida search interest is associated with an average increase of about 0.019 visitor units, holding beach searches constant. Conversely, “search_beach” bears a negative coefficient of −0.01299 (SE = 0.00262, t = −4.96, p = 1.15 × 10⁻⁶), suggesting that higher relative interest in beach searches corresponds to a slight decrease in visitors when Florida searches are held fixed. The residual standard error of 0.4734 on 313 degrees of freedom reflects moderate scatter around the fitted line. With a multiple R-squared of 0.1241 (adjusted R² = 0.1185), the model explains roughly 12 % of the variance in weekly visitor counts. The overall F-statistic (22.17 on 2 and 313 DF, p ≈ 9.9 × 10⁻¹⁰) confirms that the combination of these predictors significantly improves prediction over a null model. The overall F-test (F(2, 313) = 22.17, p ≈ 9.9 × 10⁻¹⁰) allows rejection of H₀—that there is no relationship between online search interest and visitor volume—and indicates that, together, the Google Trends predictors explain a significant portion of the variation in weekly visitor counts. In particular, the positive and highly significant coefficient on “search_florida” (β = 0.01936, p = 2.05 × 10⁻¹⁰) directly supports H₁ by showing that increases in Florida-related search interest are associated with higher visitor counts. The negative coefficient on “search_beach” (β = –0.01299, p = 1.15 × 10⁻⁶) suggests a more complex, conditional relationship—likely reflecting multicollinearity and will be addressed in the next step—yet it does not undermine the clear positive link between overall search activity and tourist volume in Florida.

Description of plots:

The diagnostic plots reveal several key insights about the model. First, the Residuals vs. Fitted plot exhibits a slight curve and a funnel‐shaped pattern, indicating that the predictors do not capture all of the variability and that a non‐linear relationship or heteroscedasticity may be present. The Q–Q plot shows that the residuals deviate from the diagonal line—particularly in the tails—signifying non‐normality, which could compromise the accuracy of p-value–based inference even if predictive performance remains adequate. The Scale–Location plot further confirms heteroscedasticity, displaying a downward trend that demonstrates decreasing error variance as fitted values increase, thus violating the assumption of constant spread. Finally, the Residuals vs. Leverage plot reveals that most observations are tightly clustered, with only a few points (such as observations 72, 71, and 76) exerting slightly higher leverage but none appearing excessively influential. Overall, although the model achieves statistical significance.

Multicollinearity was computed, and it was high with a value of 0.8.When reflecting on this, the close relationship is expected given their shared context and natural overlap, but it does not appear to skew the results or interpretations in a significantly negative way. Since the model is predictive and multicollinearity did not harm accuracy, both variables could be kept but I was curious to see the results of Lasso Regression.

Lasso Regression can be used when there is high multicollinearity between variables because it applies an L1 penalty that can shrink some coefficient estimates exactly to zero, effectively performing variable selection and reducing model complexity.At lambda.min, both predictors were retained: search_florida: +0.322 search_beach: −0.243. At lambda.1se (a more conservative model chosen by the 1-standard-error rule), search_florida was removed: (Intercept): 2.414445 search_florida: 0.000000 search_beach: . Lasso removed search_florida entirely, selecting a simpler model with only the intercept and a diminished contribution from search_beach. This process demonstrated how Lasso can both mitigate multicollinearity and enhance model interpretability by prioritizing the most unique predictive signals.

Conclusion/Abstract:

The multiple linear regression model showed that both “search_florida” and “search_beach” were statistically significant predictors of estimated weekly visitor counts, with R² = 0.1241 and a highly significant F-statistic (F(2, 313) = 22.17, p < 0.0001). Notably, “search_florida” was positively associated with tourism volume (β = 0.01936, p = 2.05 × 10⁻¹⁰), directly supporting the alternative hypothesis (H₁) that increased search activity corresponds to higher visitor counts. The negative coefficient on “search_beach” (β = –0.01299, p = 1.15 × 10⁻⁶) likely resulted from multicollinearity (correlation = 0.84), which prompted further investigation. To address this, a Lasso Regression model was implemented. At the optimal penalty (lambda.min), both predictors were retained. However, at the more conservative lambda.1se, Lasso zeroed out “search_florida,” favoring a simpler model that prioritized only “search_beach”—demonstrating the model’s ability to automatically resolve multicollinearity and preserve only the strongest predictive signal.

A key strength of this analysis lies in its use of real-time, publicly available Google Trends data to forecast visitor volume in a major tourism economy. Future enhancements could involve incorporating complementary variables such as weather patterns, airfare costs, or major event calendars to boost predictive accuracy. The methodology presented here can also be applied to other regions or industries where digital behavior reflects real-world activity, offering a scalable foundation for data-driven planning.

Search This Blog

Advanced Statistics and Analysis using R Blog

Final Project

Comments

Post a Comment

Popular posts from this blog

Module # 7- Regression models

Module #5- Correlation Analysis