The classroom
looked like a call center. Long tables were divided by partitions into
individual work stations, where students sat at computers. At one of the
stations, a student was logged into software on a distant server,
working on math problems at her own pace. The software presented
questions, she answered them, and the computer instantly evaluated the
answer. When she answered a sufficient number of problems correctly, she
advanced to the next section. The gas-plasma monitors attached to the
computers displayed the text and graphics with a monochromatic orange
glow.
This was in 1972. The
computer terminals were connected to a mainframe computer at the
University of Illinois at Urbana-Champaign, which ran software called
Programmed Logic for Automated Teaching Operations (PLATO).
The software had been developed in the nineteen-sixties as an
experiment in computer-assisted instruction, and by the seventies a
nationwide network allowed thousands of terminals to simultaneously
connect to the mainframes in Urbana.* Despite the success of the
program, PLATO was far from a new concept—from the
earliest days of mainframe computing, technologists have explored how to
use computers to complement or supplement human teachers.
At
first glance, this impulse makes sense. Computers are exceptionally
good at tasks that can be encoded into routines with well-defined
inputs, outputs, and goals. The first PLATO-specific programming language was TUTOR, an authoring language that allowed programmers to write online problem sets.* A problem written in TUTOR
had, at minimum, a question and an answer bank. Some answer banks were
quite simple. For example, the answer bank for the question “What is 3 +
2?” might be “5.” But answer banks could also be more complicated, by
accepting multiple correct answers or by ignoring certain prefatory
words. For instance, the answer bank for that same question could also
be “<it, is, the, answer> (5, five, fiv).” With this more
sophisticated answer bank, TUTOR would accept “5,” “five,” “fiv,” “it is 5,” or “the answer is fiv” as correct answers.
This sort of pattern-matching was at the heart of TUTOR. Students typed characters in response to a prompt, and TUTOR determined if those characters matched the accepted answers in the bank. But TUTOR
had no real semantic understanding of the problem being posed or the
answers given. “What is 3+2?” is just an arbitrary string of characters,
and “5” is just one more arbitrary character. TUTOR did
not need to evaluate the arithmetic of the problem. It could simply
evaluate whether the syntax of a student’s answer matched the syntax of
the answer bank.
Humans are much
slower than computers at this kind of pattern-matching, as anyone who
has graded a stack of homework can attest, and as a response educators
have developed a variety of technologies to speed up the process.
Scantron systems allow students to encode answers as bubbles on
multiple-choice forms, and optical-recognition tools quickly identify
whether the bubbles are filled in correctly. In eighteenth-century
America, one-room schoolhouses employed the monitorial method, in which
older students evaluated the recitations of younger ones. Younger
students memorized sections of textbooks and recited those sections to
older students, who had previously memorized the same sections and had
the book in front of them for good measure. Monitors often did not
understand the semantic meaning of the words being recited, but they
insured that the syntactical input (the recitation) matched the answer
bank (the textbook). TUTOR and its descendants are very fast and accurate versions of these monitors.
Forty years after PLATO,
interest in computer-assisted instruction is surging. New firms, such
as DreamBox and Knewton, have joined more established companies like
Achieve3000 and Carnegie Learning in providing “intelligent tutors” for
“adaptive instruction” or “personalized learning.” In the first quarter
of 2014, over half a billion dollars was invested in
education-technology startups. Not surprisingly, these intelligent
tutors have grown fastest in fields in which many problems have
well-defined correct answers, such as mathematics and computer science.
In domains where student performances are more nuanced, machine-learning
algorithms have seen more modest success.
Take,
for instance, essay-grading software. Computers cannot read the
semantic meaning of student texts, so autograders work by reducing
student writing to syntax. First, humans grade a small training set of
essays, which then go through a process of text preparation. Autograders
remove the most common words, like “a,” “and,” and “the.” The order of
words is then ignored, and the words are aggregated into a list,
evocatively called a “bag of words.” Computers calculate different
relationships among these words, such as the frequency of all possible
pairwise combinations of any two words, and summarize these
relationships as a quantitative expression. For each document in the
training set, the autograder then correlates the quantitative
representation of the syntax of the essay with the grade assigned by the
human, the assessment of semantic quality.
The
final step is pattern-matching. The algorithm searches through each
essay, compares the syntactic patterns in the ungraded essay to the
pattern of essays in the training set, and assigns a grade based on its
similarity with those syntactic patterns. In other words, if an ungraded
bag of words has the same quantitative properties as a high-scoring bag
of words from the training set, then the software assigns a high score.
If those syntactic patterns are more similar to a low-scoring essay,
the software assigns a low score.
In
some ways, grading by machine learning is a marvel of modern
computation. In other ways, it’s a kluge that reduces a complex human
performance into patterns that can be algorithmically matched. The
performance of these autograding systems is still limited and public
suspicion of them is high, so most intelligent tutoring systems have
made no effort to parse student writing. They stick to the parts of the
curriculum with the most constrained, structurally defined answers, not
because they are the most important but because the pattern-matching is
easier to program.
This presents an odd conundrum. In the forty years since PLATO,
educational technologists have made progress in teaching parts of the
curriculum that can be most easily reduced to routines, but we have made
very little progress in expanding the range of what these programs can
do. During those same forty years, in nearly every other sector of
society, computers have reduced the necessity of performing tasks that
can be reduced to a routine. Computers, therefore, are best at assessing
human performance in the sorts of tasks in which humans have already
been replaced by computers.
Perhaps
the most concerning part of these developments is that our technology
for high-stakes testing mirrors our technology for intelligent tutors.
We use machine learning in a limited way for grading essays on tests,
but for the most part those tests are dominated by assessment
methods—multiple choice and quantitative input—in which computers can
quickly compare student responses to an answer bank. We’re pretty good
at testing the kinds of things that intelligent tutors can teach, but
we’re not nearly as good at testing the kinds of things that the labor
market increasingly rewards. In “Dancing with Robots,”
an excellent paper on contempotary education, Frank Levy and Richard
Murnane argue that the pressing challenge of the educational system is
to “educate many more young people for the jobs computers cannot do.”
Schooling that trains students to efficiently conduct routine tasks is
training students for jobs that pay minimum wage—or jobs that simply no
longer exist.
No comments:
Post a Comment