Chapter 12 Data Categorization
Before we delve into various economic indicators, it is essential to build a vocabulary of how data is categorized. This categorization is primarily based on the type of measurement and the dimensions of data:
- Qualitative vs. Quantitative: This refers to the nature of the data being collected, whether it’s descriptive (qualitative) or numerical (quantitative).
- Discrete vs. Continuous: This classification is based on the quantifiable values that the data can take on, either as distinct and countable (discrete) or as an infinite range of values within a certain scale (continuous).
- Levels of Measurement: Data can also be classified based on the scale of measure it uses, such as nominal, ordinal, interval, or ratio.
- Index vs. Absolute Data: This categorization depends on whether the data values are meaningful in relative terms or absolute terms. Index data is meaningful in relation to other values of the same index, whereas absolute data values have meaning on their own, independently of other values.
- Stock vs. Flow: This dichotomy classifies data based on whether they represent a quantity at a specific point in time (stock) or a rate over a period (flow).
- Data Dimensions: This categorization refers to the way data is organized across temporal and spatial dimensions, including cross-sectional, time-series, panel, spatial, and clustered types of data.
Understanding these categories will allow us to more effectively interpret and analyze economic indicators.
12.1 Qualitative vs. Quantitative
In the world of data analysis and research, data is typically classified into two broad categories: Qualitative and Quantitative.
Qualitative data, often called categorical data, refers to non-numerical information that expresses the descriptive, subjective, or explanatory attributes of the variables under study. This data type includes factors such as colors, gender, nationalities, opinions, or any other attribute that does not have a natural numerical representation. In other words, qualitative data deals with characteristics and descriptors that can’t be easily measured, but can be observed subjectively.
On the other hand, quantitative data is numerical and lends itself to mathematical and statistical manipulation. This data type involves numbers and things measurable in a quantifiable way. Examples of quantitative data include height, weight, age, income, temperature, or any variable that can be measured or counted. Quantitative data forms the backbone of any statistical analysis and tends to be more structured than qualitative data.
As an example, we can consider the Affairs
dataset from the AER
(Applied Econometrics with R) package, which is utilized for teaching and learning applied econometrics.
The Affairs
dataset comes from a study by Ray C. Fair in 1978, published in the Journal of Political Economy, titled “A Theory of Extramarital Affairs”. The study attempted to model the time spent in extramarital affairs as a function of certain background variables.
# Load the AER package
library("AER")
# Load the Affairs dataset
data("Affairs", package = "AER")
# Display a few rows of the Affairs dataset
tail(Affairs)
## affairs gender age yearsmarried children religiousness education
## 1935 7 male 47 15.0 yes 3 16
## 1938 1 male 22 1.5 yes 1 12
## 1941 7 female 32 10.0 yes 2 18
## 1954 2 male 32 10.0 yes 2 17
## 1959 2 male 22 7.0 yes 3 18
## 9010 1 female 32 15.0 yes 3 14
## occupation rating
## 1935 4 2
## 1938 2 5
## 1941 5 4
## 1954 6 5
## 1959 6 2
## 9010 1 5
This dataset comprises 601 observations and nine variables, with a mixture of four qualitative and five quantitative variables:
affairs
: Quantitative - how much time is spent in extramarital affairs.gender
: Qualitative - gender of the respondents.age
: Quantitative - age of the respondents.yearsmarried
: Quantitative - number of years married.children
: Qualitative - whether the respondents have children or not.religiousness
: Qualitative - rating of religiousness, 1 being not religious at all, and 5 being very religious.education
: Quantitative - level of education in years of schooling.occupation
: Qualitative - category of the respondent’s occupation.rating
: Quantitative - a self-rating of the marriage, 1 indicates very unhappy, and 5 indicates very happy.
Although certain qualitative variables are numerically coded, they remain qualitative in nature. They signify varying categories instead of directly measuring an attribute.
12.2 Discrete vs. Continuous
When dealing with quantitative data, it is further divided into two categories: Discrete and Continuous. The distinction between these two lies in the nature of the values or observations that the variables can assume.
Discrete data, as the name suggests, can only take certain, separated values. These values are distinct and countable, and there are no intermediate values between any two given values. Examples of discrete data include the number of students in a class, the number of cars in a parking lot, or any other count that cannot be divided into smaller parts.
Contrarily, continuous data can take any value within a given range or interval. Here, the data points are not countable as they may exist at any point along a continuum within a given range. This means that the variables have an infinite number of possibilities. Examples of continuous data include the height of a person, the weight of an animal, or the time it takes to run a race. In these examples, the measurements can be broken down into smaller units, and a great number of precise values are possible.
For instance, let’s examine the mtcars
dataset that comes with the R programming language. This dataset provides information on various attributes of 32 automobiles from the 1970s.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
One of the variables in this dataset, mpg
(miles per gallon), is an example of a continuous variable. Continuous variables, such as mpg
, can take on any value within a range. In this context, a car’s fuel efficiency (measured in miles per gallon) can theoretically take any non-negative value. This makes mpg
a continuous variable.
On the other hand, the cyl
variable, representing the number of cylinders in a car’s engine, is a discrete variable. Discrete variables can only take certain, distinct values. In the case of cyl
, cars can have a whole number of cylinders—usually 4, 6, or 8. This characteristic makes cyl
a discrete variable.
12.3 Levels of Measurement
Qualitative and quantitative data can be classified into four levels of measurement: nominal, ordinal, interval, and ratio These levels represent an increasing degree of precision in measurement. The first two levels, nominal and ordinal, are applicable to qualitative data, while the last two levels, interval and ratio, are relevant to (discrete or continuous) quantitative data.
12.3.1 Nominal Scale Variables
Nominal scale variables are a type of categorical variable that represent distinct groups or categories without an inherent order or ranking. These variables essentially “name” the attribute and don’t possess any quantitative significance. In the field of economics, an example of a nominal variable could be the industry sector of a company, such as technology, healthcare, manufacturing, and so on. These are distinct categories with no implied order or priority.
In R, nominal variables can be represented using the factor()
function. Here’s an example of how this can be done:
# Create a factor vector to represent industry sectors
sectors <- factor(c("technology", "healthcare", "manufacturing", "technology", "manufacturing"),
levels = c("technology", "healthcare", "manufacturing"))
# Display the created factor vector
print(sectors)
## [1] technology healthcare manufacturing technology manufacturing
## Levels: technology healthcare manufacturing
In the above code snippet, the factor()
function is used to convert a character vector into a factor vector, with the distinct industry sectors specified as the levels
. This approach can enhance data analysis efficiency, as R internally assigns a distinct integer to each level. Hence, when dealing with large datasets, R can quickly refer to these numerical assignments instead of the corresponding character strings, improving processing speed.
The levels()
function can be used to extract the defined levels of a factor vector, while the as.numeric()
function can convert the factor levels to their corresponding numeric codes, as shown below:
## [1] "technology" "healthcare" "manufacturing"
## [1] 1 2 3 1 3
In the output of this code, the levels()
function will display the distinct industry sectors, while the as.numeric()
function will present the numerical code assigned to each sector in the original vector.
12.3.2 Ordinal Scale Variables
Ordinal scale variables, like nominal variables, are categorical. However, ordinal scale variables have a clear ordering of the variables. An example in economics might be a credit rating (e.g., AAA, AA, A, BBB, BB). Here’s an example:
# Creating a factor with ordered levels
credit_rating <- factor(c("AAA", "AA", "AAA", "BBB", "BB"),
levels = c("C", "B", "BB", "BBB", "A", "AA", "AAA"),
ordered = TRUE)
print(credit_rating)
## [1] AAA AA AAA BBB BB
## Levels: C < B < BB < BBB < A < AA < AAA
In this code, the ordered = TRUE
argument tells R that the levels should be considered in a specific order.
12.3.3 Interval Scale Variables
Interval scale variables are a type of numerical variable where the distance between two values is meaningful. However, they lack a true zero point. An example in economics is the difference between two credit scores. The difference between a score of 800 and 700 is the same as the difference between 700 and 600, but a score of 0 does not imply the absence of creditworthiness, it’s just a lower bound.
Interval data lacks the absolute zero point, which makes direct comparisons of magnitude impossible. A person with a credit score of 800 is not twice as creditworthy as someone with a score of 400.
# Creating a numeric vector of credit scores
credit_scores <- c(800, 750, 700, 650, 600)
print(credit_scores)
## [1] 800 750 700 650 600
Another example of an interval scale variable is temperature when measured in Celsius or Fahrenheit. For instance, 20 degrees Celsius is not twice as hot as 10 degrees Celsius. Furthermore, 0 degrees Celsius doesn’t imply the absence of temperature. Thus, the Celsius scale is an interval scale where the differences between values are meaningful, but it doesn’t have a true zero point or a ratio relationship between different temperatures.
12.3.4 Ratio Scale Variables
Ratio scale variables, like interval scale variables, are numeric, but they have a clear definition of zero. When the value of a ratio scale variable is zero, it means the absence of the quantity. Examples in economics include income, annual sales, market share, unemployment rate, and GDP. When GDP is zero it means that no output is produced in that country during the year.
# Creating a numeric vector of GDP values in billions of dollars
gdp_billions <- c(19362, 21195, 22675, 21433, 22770)
print(gdp_billions)
## [1] 19362 21195 22675 21433 22770
The presence of a meaningful zero point allows us to make valid comparisons using ratios. For instance, one country’s GDP being twice as large as another’s is a meaningful statement.
Under normal circumstances, ratio scale variables cannot assume negative values, as this would imply a quantity less than nothing, which is nonsensical. Variables such as age, height, weight, or income are all examples of ratio scale variables as they can be compared using ratios (someone can be twice as old or earn thrice as much income as someone else), and their zero points are meaningful (zero age means no time has passed since birth, zero height or weight means an absence of these attributes, and zero income means no money has been earned). Therefore, negative values for these variables would not make sense.
However, there are certain exceptions where variables typically considered ratio scale variables can indeed be negative. These exceptions include instances such as net income, inflation, and interest rates. Here, a “negative” value does not imply less than nothing but rather denotes a particular condition or state. Negative net income reflects a state of indebtedness, negative inflation indicates a deflation, and negative interest rates imply a penalty for storing money at a bank. Despite being negative, these values can still be meaningfully compared to each other using ratios. For instance, 4% inflation is twice as high as 2% inflation, and deflation of 2% (equivalent to -2% inflation) is twice as high as deflation of 1% (or -1% inflation). However, one cannot meaningfully compare positive inflation to deflation using ratios. For example, 4% inflation cannot be compared using ratios to 2% deflation.
12.4 Index vs. Absolute Data
The field of economics and finance often uses two types of data for analysis: index data and absolute data. Unlike absolute data, which provides valuable insights on its own, index data is only meaningful when considered in relative terms.
12.4.1 Index Data
Index data, often referred to as indices or indexes, and not to be confused with indicators, are measures which values have no meaning on their own. An index value only becomes meaningful as comparison to a value at a different time period.
Stock market indices are prime examples of index data. These indices combine various stock prices into a normalized index. As part of this process, an arbitrary value, such as 100, is assigned to the index at a given point in time (e.g., 2010). If the index subsequently rises to 110, this indicates a 10% increase in the total price of the underlying stocks. However, without the context provided by the previous index value (100), the new value (110) lacks meaningful interpretation.
The reason for creating an index lies in the complexity of making raw values meaningful on their own. Simply adding all share prices together wouldn’t provide valuable insights, as the share price of a particular company may be much higher than others due to arbitrary factors like the number of shares they’ve issued. Instead, stock market indices typically aggregate share prices by normalizing their contributions to the index based on factors such as the size of the company. During this aggregation process, the relative weights of the stocks carry importance, but the sum of these weights — whether it’s 1, 100, or 34 — is irrelevant. This results in an index that lacks inherent meaning on its own, and thus, it only carries significance in relative terms. After completing this aggregation process, the index is standardized to an arbitrary value, often set at 100 at a specific moment in time.
12.4.2 Absolute Data
On the other hand, absolute data refers to data that has inherent meaning on its own, without needing comparison to other data.
A fitting example of absolute data is the Gross Domestic Product (GDP), which measures the total market value of all finished goods and services produced within a country’s borders over a specific time period. A GDP value, for instance, $24,000 billion USD, provides a self-explanatory measurement. It indicates that, during that year, the country produced goods and services worth $24,000 billion USD, irrespective of previous values.
In summary, while index data offers relative comparisons that highlight trends and changes over time, absolute data also provides measurements that are meaningful on their own.
12.5 Stock vs. Flow
Economic and financial data can be broadly categorized into two types of variables: stock and flow.
12.5.1 Stock Variables
A stock variable represents a quantity measured at a single specific point in time. In a sense, they provide a snapshot of a specific moment. Here are examples of stock variables:
- Wealth: The total accumulation of assets a person owns at a given point in time.
- Debt: The total amount owed by a person, business, or country at a certain moment.
- Population: The total number of people in a country or region at a given time.
- Capital Stock: The total value of assets held by a firm at a point in time.
- Unemployment Level: The total number of people unemployed at a specific point in time.
- Inventory: The total amount of goods a company has in stock at a given time.
- Reserves: The total amount of currency held by a central bank at a specific time.
- Households’ savings: The total amount of unconsumed income of a household at a point in time.
12.5.2 Flow Variables
A flow variable, on the other hand, is measured over an interval of time. Therefore, a flow would be measured per unit of time (say a year or a month). Here are examples of flow variables:
- Income: Money that an individual or business receives over a period of time.
- Spending: Money spent by individuals or businesses over a certain period.
- Production: The total amount of goods and services produced over a time period.
- Consumption: The total goods and services consumed over a period of time.
- Investment: Money invested by a business over a time period.
- Imports and Exports: Goods and services brought into or sent out of a country over a time period.
- Government spending: Total expenditure by the government over a certain period.
- Changes in inventory: The difference in inventory levels between two points in time.
12.5.3 Accounting
The concepts of stock and flow are also evident in the financial statements of companies. The Balance Sheet presents the stock of what a company owns (assets) and owes (liabilities) at a specific point in time, while the Income Statement depicts the flow of revenues and expenses that result in net income or loss over a particular period. This clear separation aids in understanding a company’s financial position (stock) and performance (flow) separately, helping various stakeholders, including investors, creditors, and management, make informed decisions.
In conclusion, understanding the difference between stock and flow variables is crucial in economics, finance, and accounting, as it provides insights into a wide range of issues, from personal financial planning to the evaluation of a country’s economic performance or a company’s financial health.
12.6 Data Dimensions
Understanding the dimensions of a dataset is crucial in data analysis and econometrics. The dimensions of a dataset refer to how the observations are organized across different dimensions such as time and space.
There are five principal arrangements of data across dimensions in economic data:
Each type of data arrangement presents its own unique characteristics, challenges, and opportunities for data analysis.
12.6.1 Cross-Sectional Data
Cross-sectional data refer to information collected from multiple subjects at a specific point in time. In such data, each row represents a unique observation or individual. Examples of cross-sectional data are surveys and administrative records of persons, households, or firms. When analyzing this data, the standard assumption is that observations are independent, meaning that the ordering of observations does not matter.
Consider the Ecdat
package in R, which contains numerous datasets used in econometrics textbooks. A good example of cross-sectional data is the Wages
dataset from the Ecdat
package. The dataset provides information about workers’ wages and relevant characteristics.
# Load the Ecdat package
library(Ecdat)
# Load the Wages dataset
data(Wages)
# Display the first few rows of the Wages dataset
head(Wages)
## exp wks bluecol ind south smsa married sex union ed black lwage
## 1 3 32 no 0 yes no yes male no 9 no 5.56068
## 2 4 43 no 0 yes no yes male no 9 no 5.72031
## 3 5 40 no 0 yes no yes male no 9 no 5.99645
## 4 6 39 no 0 yes no yes male no 9 no 5.99645
## 5 7 42 no 1 yes no yes male no 9 no 6.06146
## 6 8 35 no 1 yes no yes male no 9 no 6.17379
In the Wages
dataset, each row corresponds to a unique worker, providing details such as their education, work experience, region, gender, and union membership. The order of workers does not affect the interpretation of the data, thereby affirming the dataset’s cross-sectional nature.
Another example of cross-sectional data is the Hdma
dataset from the Ecdat
package. The Hdma
dataset comprises of information regarding home loans in Boston, as gathered by the Home Mortgage Disclosure Act (HMDA).
## dir hir lvr ccs mcs pbcr dmi self single uria comdominiom black
## 1 0.221 0.221 0.8000000 5 2 no no no no 3.9 0 no
## 2 0.265 0.265 0.9218750 2 2 no no no yes 3.2 0 no
## 3 0.372 0.248 0.9203980 1 2 no no no no 3.2 0 no
## 4 0.320 0.250 0.8604651 1 2 no no no no 4.3 0 no
## 5 0.360 0.350 0.6000000 1 1 no no no no 3.2 0 no
## 6 0.240 0.170 0.5105263 1 1 no no no no 3.9 0 no
## deny
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
The Hdma
dataset presents each row as a loan applicant, including details about the applicant’s income, gender, race, loan specifications, and the decision made by the bank. As the ordering of applicants does not affect data interpretation, this dataset also qualifies as cross-sectional data.
Finally, consider the built-in state.x77
dataset in R, sourced from the 1977 Statistical Abstract of the United States. The dataset consists of various economic and demographic measures of the U.S. states.
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
The state.x77
dataset comprises 50 rows and 8 columns, each row corresponding to one of the 50 US states. The columns represent a variety of measurements including the population, income, illiteracy rates, life expectancy, murder rates, high school graduation rates, frost frequency, and land area. The data is ordered alphabetically according to the state’s name, but changing the order wouldn’t affect interpretation, making this a cross-sectional dataset. Despite the geographic attributes of each state, the absence of specific geographic information (such as latitude and longitude) qualifies this dataset as cross-sectional rather than spatial data.
12.6.2 Time Series Data
Time series data are observations that are indexed by time. Each row in a time series dataset typically represents an observation at a distinct time point. Examples of this type of data are annual GDP, daily interest rates, and the hourly share price of Tesla. The key characteristic of time series data is that the ordering of observations does matter, and observations are not assumed to be independent of one another.
The datasets
package in R, which is automatically installed and loaded with R, includes several built-in time series datasets that are commonly used for illustrative purposes. Consider the AirPassengers
dataset as an example:
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
The AirPassengers
dataset includes monthly totals of international airline passengers from 1949 to 1960. Each row represents a particular month, and the order of the data matters due to potential temporal dependencies, such as seasonality or trends.
Another example of a time series dataset is EuStockMarkets
which includes daily closing prices of major European stock indices from 1991 to 1998:
# Load EuStockMarkets dataset
data(EuStockMarkets)
# Display the first few rows, and include the Date variable
head(data.frame(Date = time(EuStockMarkets), EuStockMarkets))
## Date DAX SMI CAC FTSE
## 1 1991.496 1628.75 1678.1 1772.8 2443.6
## 2 1991.500 1613.63 1688.5 1750.5 2460.2
## 3 1991.504 1606.51 1678.6 1718.0 2448.2
## 4 1991.508 1621.04 1684.1 1708.1 2470.4
## 5 1991.512 1618.16 1686.6 1723.1 2484.7
## 6 1991.515 1610.61 1671.6 1714.3 2466.8
In this dataset, each row represents a trading day, and the order of the days is crucial to understanding trends and volatilities in the stock markets.
Lastly, consider the longley
dataset, available in R’s built-in datasets:
## GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
## 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
## 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
## 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
## 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
## 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
## 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
The longley
dataset includes annual observations on six macroeconomic variables for the United States, spanning from 1947 to 1962. Each row stands for a year, and the ordering of the years carries substantial information about the economic progression of the country.
Aggregate economic data such as the longley
dataset is typically available at low frequency (annual, quarterly, or perhaps monthly) so the sample size is quite small, while financial data such as the EuStockMarkets
dataset is typically available at high frequency (weekly, daily, hourly, or per transaction).
In all of these examples, the order of observations matters significantly because it allows for the analysis of trends, cycles, and other time-dependent structures within the data. Proper handling of these elements is crucial for effective time series analysis. This often involves the application of statistical methods designed specifically for time series data, such as autoregressive integrated moving average (ARIMA) models and vector autoregression (VAR) models. These models take into account the temporal dependencies and provide a more reliable analysis of the data.
12.6.3 Panel Data
Panel data, also known as longitudinal data or cross-sectional time series data, combines elements of both cross-sectional and time series data. In panel data, observations are collected on multiple entities (such as individuals, households, firms, or countries) over multiple time periods. This data structure provides unique opportunities to analyze both individual and time effects, as well as their interaction.
To illustrate panel data, we can explore some examples using R datasets.
One example is the Produc
dataset, which is available in the plm
package:
# Load the plm package
library("plm")
# Load the Produc dataset
data("Produc", package = "plm")
# Display a few rows of the Produc dataset
Produc[15:20, ]
## state year region pcap hwy water util pc gsp emp
## 15 ALABAMA 1984 6 19257.47 8655.94 2235.16 8366.37 59446.86 45118 1387.7
## 16 ALABAMA 1985 6 19433.36 8726.24 2253.03 8454.09 60688.04 46849 1427.1
## 17 ALABAMA 1986 6 19723.37 8813.24 2308.99 8601.14 61628.88 48409 1463.3
## 18 ARIZONA 1970 8 10148.42 4556.81 1627.87 3963.75 23585.99 19288 547.4
## 19 ARIZONA 1971 8 10560.54 4701.97 1627.34 4231.23 24924.82 21040 581.4
## 20 ARIZONA 1972 8 10977.53 4847.84 1614.58 4515.11 26058.65 23289 646.3
## unemp
## 15 11.0
## 16 8.9
## 17 9.8
## 18 4.4
## 19 4.7
## 20 4.2
The Produc
dataset contains state-level data for the United States from 1970 to 1986. It includes variables such as state GDP, capital stock, employment, and wages. With data from 48 states, each state having 17 yearly observations, this dataset combines both cross-sectional and time-series components. This panel structure allows for analyzing the relationship between these variables over time, while also accounting for heterogeneity across states.
Another example is the EmplUK
dataset from the plm
package:
# Load the EmplUK dataset
data("EmplUK", package = "plm")
# Display a few rows of the EmplUK dataset
EmplUK[19:24, ]
## firm year sector emp wage capital output
## 19 3 1981 7 19.570 24.8714 6.2136 99.5581
## 20 3 1982 7 18.125 24.8447 5.7146 98.6151
## 21 3 1983 7 16.850 28.9077 7.3431 100.0301
## 22 4 1977 8 26.160 14.8283 8.4902 118.2223
## 23 4 1978 8 26.740 14.8379 8.7420 120.1551
## 24 4 1979 8 27.280 14.8756 9.1869 118.8319
The EmplUK
dataset contains information on employment in different industries in the United Kingdom from 1977 to 1983. It includes variables such as employment, wages, and output. This dataset allows for analyzing how employment and wages vary across industries and over time.
In panel data analysis, various econometric techniques can be applied to account for time-varying effects, individual-specific heterogeneity, and potential endogeneity. Common methods include fixed effects models, random effects models, and dynamic panel data models. These approaches help uncover relationships that may be obscured in cross-sectional or time series analysis alone.
12.6.4 Spatial Data
Spatial data are observations that are indexed by geographical locations or coordinates. Each row in a spatial dataset typically represents an observation at a distinct spatial unit. Examples of this type of data are housing prices across different districts, climate data from different weather stations, and population densities across various regions. The key characteristic of spatial data is that the geographical arrangement of observations does matter, and observations are not assumed to be independent of one another, often exhibiting spatial autocorrelation where observations close in space tend to be more similar.
The house
dataset, available in the spdep
package, serves as a fitting example for spatial data analysis:
# Load the spdep package
library("spdep")
# Load the house dataset
data(house)
# Display a few rows and columns of the house dataset
head(house[, 1:6])
## coordinates price yrbuilt stories TLA wall beds
## 1 (484668.1, 195270.3) 303000 1978 one+half 3273 partbrk 4
## 2 (484875.6, 195301.3) 92000 1957 one 920 metlvnyl 2
## 3 (485248.4, 195353.8) 90000 1937 two 1956 stucdrvt 3
## 4 (485764.2, 196177.5) 330000 1887 one+half 1430 wood 4
## 5 (488149.8, 196591.4) 185000 1978 two 2208 partbrk 3
## 6 (485525.7, 196660) 100000 1989 one 1232 wood 1
The house
dataset consists of details pertaining to 25,357 single-family homes sold in Lucas County, Ohio, between 1993 and 1998. It includes particulars such as sale prices, locations, and other relevant features. Additionally, the dataset reveals the GPS coordinates of each house in the coordinates
column. The first value in the coordinates
column denotes the northern coordinate, while the second one refers to the western coordinate. Researchers can employ these GPS coordinates to conduct spatial analyses, examining housing market dynamics in Lucas County while taking into account spatial autocorrelation and spatial dependence.
It’s important to note that the regional datasets previously discussed in the context of cross-sectional and panel data can also be classified as spatial data. To perform spatial data analysis on these regional datasets, one needs to merge them with another dataset that includes GPS or longitude and latitude data for each region. This geographical data is often available online; for instance, one can find the longitude and latitude for each U.S. state or county with relative ease.
By integrating geographic information and taking into account spatial dependencies, economists can improve the accuracy of their empirical results, gaining deeper insights into the influence of geographical factors on economic outcomes. Common methods include spatial lag models, spatial error models, and spatial autoregressive models. These approaches help reveal relationships that may be hidden when using cross-sectional or time series analysis alone.
12.6.5 Clustered Data
Clustered data is a form of data that may resemble panel or spatial data, as it groups observations based on criteria such as geographical location, time period, or subject. This clustering implies that observations within each group are likely to exhibit a degree of correlation, meaning they are more likely to resemble each other than observations in different clusters.
However, the approach to clustered data contrasts with panel or spatial data in an important way: in clustered data, the specific nature of the correlation within each cluster is not typically explicitly modeled. For example, while we might expect houses sold within the same county (a cluster) to have more similar characteristics than houses sold across different counties, we don’t explicitly model the specific relationships among the houses within each county so that the order of observations within clusters doesn’t matter.
On the other hand, panel data has a time-dependent structure as it involves repeated observations collected from the same subjects over time. Similarly, spatial data has a location-dependent structure, where observations are based on their geographical locations. In both these cases, the dependencies within the data are explicitly incorporated into the data analysis.
Consider the EmplUK
dataset as an example of clustered data. This dataset contains panel data on employment and wages for different industries in the United Kingdom from 1977 to 1987, as discussed earlier in the panel data section:
## firm year sector emp wage capital output
## 19 3 1981 7 19.570 24.8714 6.2136 99.5581
## 20 3 1982 7 18.125 24.8447 5.7146 98.6151
## 21 3 1983 7 16.850 28.9077 7.3431 100.0301
## 22 4 1977 8 26.160 14.8283 8.4902 118.2223
## 23 4 1978 8 26.740 14.8379 8.7420 120.1551
## 24 4 1979 8 27.280 14.8756 9.1869 118.8319
## 25 4 1980 8 27.830 15.2332 9.4036 111.9164
The EmplUK
dataset is clustered by industry, suggesting that observations within the same industry may exhibit similar employment patterns or be influenced by industry-specific factors. Depending on the research question and context, the EmplUK
dataset could be treated as either panel data or clustered data.
When analyzing the EmplUK
dataset to understand employment and wage variations across different industries over time, treating it as panel data is appropriate. In this context, each industry is a distinct unit in the panel data framework, and using panel data models allows researchers to study variations in employment and wages over time and across industries, accounting for both time effects and industry-specific effects.
In a different context, the EmplUK
dataset could be treated as clustered data. Here, the interest lies in understanding the variation and potential correlations within each industry. The clusters here are the industries themselves, and observations within each industry might be more similar to each other due to industry-specific factors such as common economic conditions, regulations, or labor market dynamics. Treating the data as clustered acknowledges the correlation within each cluster (industry) but does not model this correlation explicitly.
Hence, the choice between treating the EmplUK
dataset as panel data or clustered data depends on the research question.
When working with clustered data, it is vital to use statistical techniques that account for the correlation within clusters. Common techniques include the use of cluster-robust standard errors, fixed effects models, or random effects models.
12.7 Conclusion
In this chapter, we’ve developed a comprehensive vocabulary for understanding and categorizing data. The classifications include whether the data is described in words or numbers (qualitative or quantitative), the quantifiable values the data can assume (discrete or continuous), the scale of measurement (nominal, ordinal, interval, or ratio), its relational form (index vs. absolute data), its temporal nature (stock or flow), and its dimensions (cross-sectional, time-series, panel, spatial, and clustered). Together, these classifications provide us with a robust framework to approach and interpret different types of data that we encounter in economic analysis.
Each of these categories has unique characteristics that require specific analytical methods and considerations. For example, time-series data are analyzed differently from cross-sectional data, considering their inherent temporal order and potential autocorrelation. Similarly, data transformations, such as logarithmic or difference transformations, may be more applicable or meaningful to certain types of data than others.
In the following chapters, we will apply this vocabulary to economic indicators and the transformations thereof. We will discuss how each economic indicator fits into these categories and how specific data transformations can enhance our understanding of these indicators. Remember, the appropriate transformation depends on the specific type of data and the research question at hand, so our newly established vocabulary will prove invaluable.