Unit 3: Construct validity

Daniel Muñoz Acevedo 25 May 202025/05/20 a las 19:17 hrs.2020-05-25 19:17:25

So far, we have seen that it is important that an assessment process, a test or any measurement instrument is valid, that is to say, that it measures/ evaluates what is intended to measure. Now the question is: How do we know that an instrument or test is measuring what is intended to measure?

Short answer: in several related but different ways to provide test validity.

The sources of validity in a test are normally referred to as types of validity. All of them are sources of information that allow us to judge whether a test is testing what is supposed to be testing. However, one of the main sources of validity is not a type of validity but a different characteristic altogether. We will refer to it by the end of the post, as a way of cliff-hanger.

So, to validity now. To know that a test is measuring what is supposed to measure, we first need to understand what is being measured. In the first post in this blog, we observed the example of weather. So, to know that particular procedure and/or instrument is actually measuring the weather, we need to know first what the weather is. Sounds simple, but it is not.

In the case of weather, a simple revision of the concept yields a very, very complex set of characteristics and properties. Weather seems to be a compound fo different phenomena like moisture, temperature, wind, etc. So any instrument/ procedure we use to evaluate/ measure weather, should be evaluating/ measuring those phenomena. To make things even more interesting, when we start looking at the measurement of things like wind, humidity or air pressure, we will find that those are also complex phenomena constituted by many characteristics and properties. And so on and so forth.

The problem of measuring/ evaluating phenomena is, therefore, to first have an idea of what is that that we want to measure. Sometimes that idea can be defined simply: in daily life, the weather can be defined as temperature and/ or probability of raining. For other purposes, of course, the definition a lot more complex: for meteorologists, the weather can be a compound of multiple interrelated measurements. Remember the example of the mercury thermometer we discussed in our last chat? Well, same thing there. We need to have some comprehension of the phenomenon of temperature in order to be able to device an instrument that allows us to measure it.

The thermometer is a very good example of how our understanding of the phenomenon to be measured is essential to develop a good measurement instrument for that phenomenon. For example, Galileo's thermometer, basically a tube containing water, was affected by temperature and air pressure, which are phenomena different from temperature. Therefore, we need to know what temperature is in order to see whether the instrument we are using is actually measuring the phenomena or is also affected by different phenomena.

So, the definition of the phenomenon to be evaluated is at the core of an assessment process. In the field of assessment, that definition is called a construct or theoretical construct. For a meteorologist, the constructs behind her instruments are ideas such as temperature, wind speed, humidity, weather, etc. The construct to be measured by a thermometer is that of temperature.

It is important to notice in this explanation that instruments and tests measure constructs rather than actual phenomena. This is because phenomena can be defined by the use of different theoretical constructs. This is more obvious when trying to measure non-physical phenomena such as motivation, aptitude or (guess what) language ability or performance. In cases such as those, the instruments that can be used to test their existence are completely dependent on how we define the phenomenon.

In the case of language ability, knowledge or performance, tests can look like a series of written drills to complete with correct grammatical forms or they can look like a conversation about real life issues with a classmate. Broadly said, the decision in this case depends mostly on whether we are defining the construct of language proficiency from a structuralist or a functionalist perspective, respectively. In the first case, we conceive of language as a set of rules of formation, normally grammatical or phonological. In the latter, we consider language as a tool to communicate. So, different constructs, different tests.

The main source of validity of a language test, therefore, lies in the capacity of that test to reflect the theoretical construct of language we are using as a way to observe and understand the phenomenon of language that the theoretical construct tries to explain and characterise. Non-surprisingly, this source if validity is generally named construct validity.

In order to check whether a test is valid, therefore, we need to first know what the construct that it intends to evaluate/ test is and then judge whether the instrument or procedure is actually capable of allowing the observation and evaluation of that construct.

We will have plenty of conversations about validity, as it has become the main perspective in the field of language assessment. For the moment, we can at least say that a test is not only valid when it evaluates with precision what it is intended to evaluate. That is only the beginning of problems. This is so because a test also is valid when it is used in the appropriate decision-making process, if the stakeholders that use the test "believe" that the test measures what it is intended to measure and also if the consequences of the decisions made based on a test are also what they were intended. All of these are sources of validity different from construct validity.

Finally, a main source of validity is one characteristic of every test or assessment process: its reliability. We will learn in this seminar that it is very, very difficult to affirm that a test is valid if we do not show that its results are reliable or consistent. The relationship between validity and reliability, we will see, lies behind all the problems we normally find in the design and application of English L2 tests.

Related sources

As all theoretical constructs, weather can be presented in different ways. Compare these two explanations, one for educated adult people like us, and the other made for educated kids (also like us, in a sense).

Weather according to Wikipedia:
www.nationalgeographic.org/ ... ncyclopedia/weather/

Weather according to National Geographic for kids:
en.wikipedia.org/wiki/Weather

You can also check this video of the history of the thermometer: Fahrenheit to Celsius: History of the thermometer. Pay attention to all the problems of early attempts that are directly related to construct validity (in this case, that got in the way of measuring temperature and only temperature).

Unit 2: Assessment validity and deciding what to wear

Daniel Muñoz Acevedo 27 Abr 202027/04/20 a las 18:24 hrs.2020-04-27 18:24:27

Most of the things you are going to read about assessment will revolve around the idea of assessment validity. Validity is the property of an assessment procedure to measure what it is meant to measure. In simple terms, it means that a ruler is valid if it can measure distance, a watch is valid if it can measure time, an applause meter is valid if it can measure applauses, a…. Well, you get the idea.

This reality of measurement and evaluation is simple in its presentation: rulers measure distance, watches measure time and applause meters, applauses. Pretty obvious. However, as you may be expecting already, the problem of how to define and observe the validity of measures and evaluations is probably one of the most fascinating problems in the history of human thought.

No. I am not kidding. Not at all.

In this unit, we will start wih the very simple basics of the problem of valñidity. Let us take a simple example of a process of assessment/ measurement/ evaluation from normal, day-to-day life. This is the scene: you have just woken up, had breakfast and pretty much done all the startup things we all do after we wake up. Now you need to decide on what to wear before going out (or not going out, as it seems to be the norm today). To that purpose, we are genetically endowed with quite a complex cognitive system. To make such a simple and quick decision, that system enters into [evaluation mode].

We can start by getting to know what the weather is like right now. As we know, there are plenty of ways to get to know that. Here are some examples:

We can watch the weather widget in our smart (and not so smart) devices.
We can take our heads out of a window to check what is going on outside.
We can listen the weather forecast to the radio.
We can see if ants in the yard are building walls (Yep, for some people that used to work, too).

Based on the information you got, now you need to figure out what the weather is like right now and how it is going to be during the day, or at least for the period when you are going to be outside. To that purpose, we start making som comparisons. Things we may do make that guess include things like this:

We can look at the forecast in our computer, then look at the sky from our window and see if they match.
We can take our heads out of the window and combine our perceptions of smell, temperature, appearance of the sky, etc. and see if the profile corresponds to a previous of similar weather.
We can look at how people are dressed and compare it to our mental data bank of cloth-weather matches.

Once we have formed a judgement of what the weather is going to be like, we are ready to make a decision. And decisions can go in several ways, too. For example,

You are sure about the weather today and thus you choose the clothing you think will be adequate for the day.
You are not so sure and so you pick stuff that may be adequate for different circumstances (the classic T-shirt and coat combo!).
You have no idea what the weather is going to be like, so you grab just whatever and hope for the best (sunburning and pneumonia can start in the same assessment process, as you can see)..

This is a very good example of assessment/evaluation/scoring processes as it reflects the main features of such processes. Let me mention some of the most important ones:

1. Assessment processes are part of decision-making processes. We assess stuff because we need to make decisions. Such decisions can be minute and without much consequences ("Should I get another piece of that cranberry cake?"). Many times, decisions can be tough and critical to people's lives ("Should we stop the quarantine in this area?"). Many times, also, decisions are just somewhere in the middle of minute and critical ("What score should I give to this oral exam?").

2. Assessment process are very much like research processes: We have a question, we collect relevant data, and then we compare that data in ways that may help us understand what reality is like and so what decision we can make.

3. Assessment is, fundamentally, a process of comparison of two or more sets of data to establish one particular value (the state of the weather, the level of ability of an English language learner).

4. Since they are about making-decisions, assessment processes have consequences. So, assessment procedures can be examined in terms of the consequences they produce.

5. Since assessment processes are about decision-making, and we are pretty much making decisions all the time (I feel uncomfortable, I'm going to adopt a new posture in my sit/ What should I eat for lunch? /Should I quit this Seminar?), then it follows that we are assessing constantly. Yup. Assessment is EVERYWHERE and we are assessing every time, everything. This is so much so that, after this Seminar is over, you will not be able to even comb your hair without noticing that there is an assessment process there (otherwise you would never know when your hair is ok and would comb yourself eternally).

6. Most important, assessment processes are about determining the value of something.

This last point is the one that takes us back to concept of validity. Whatever the assessment procedure we use in the example, the whole purpose of the procedure is get a particular value right: the conditions of the weather.

Such value can be stated in very simple terms (good weather vs bad weather), in more complex terms (A period of rain likely very early then a chance of showers), or in very complex terms (there is a 27% chances of rain). In all cases, some quality of the reality we want to evaluate is selected (personal appreciation and probability, in the examples). Then values are assigned to measure or evaluate that quality (good vs bad, likely vs unlikely, percentages). However it is expressed or conceptualised the quality and its values, the job of the assessment process (measurement, evaluation, etc.) is to provide a value for that quality that reflects how reality really is.

In the example, what we want is that the value that is produced by the assessment process is right. We want the prediction in our phone app and the behaviour of the ants to correctly indicate the weather conditions of the moment. This is what we call validity. We want the ruler to indicate actual distance, weight-scales to indicate actual weight, ants to inidicate rain with precision. A ruler that measures distance is a valid ruler. A scale that measures weight is a valid scale.

Although the idea is simple, validity in assessment procedures or tests is a quality which is very difficult to observe and achieve. In the example of the weather forecast, we can see that the data we can use to get the right value of the weather can be very diverse in nature. The information provided by the application in a computer is very different from the information you gather by just looking at how people are dressed from your window. And then there is ants building walls. Somehow, we know that these sources of evidence vary widely in terms of how precisely they indicate weather conditions.

Question is: How do we know that? How do we know that there are better (more valid) ways to know weather conditions and not so good ones (less valid)?

The answer to this question has to do with our capacity to understand how the evidence we are observing (some records in the smartphone, people wearing clothes, or ants building walls) relate in reality with the quality that we want to evaluate (the condition of the weather). That connection is one of the key problems in the discussion about assessment and measurement and so deserves its own post in the next Unit of this series of posts.

This is it for now.

Please do leave comments, questions, suggestions or any response you may see fit. Since we are not having meetings for a while, this is the way to participate in the Seminar.