Word Frequency Analysis of 2007 GOP Primary Debate

I had a chance to look at the GOP primary debate this weekend between going to the amazing shows that come to town during the weeks surrounding Jazzfest.

I thought it would be fun to do a little analysis of word frequencies:
The number in parenthesis is the word count.

  • would(88)>should(67)>can(57)>want(55)>need(52)
  • think(102)>know(52)>>believe(4)
  • Governor(89), States(49)
  • Romney(50)>McCain, Giuliani(49)>Tancredo(42)>Gilmore(33)>Brownback(32)
    >Huckabee(31)>Paul,Reagan(29)>Thompson(27)>Hunter(21)
  • Faith(19)>God(7)
  • Iraq(33), Iran(19), Policy(18), Foreign(17), Nuclear(14)
  • Reagan(29)>Clinton(13)>Rove(9)>Bush(8)
  • Win(22)>Troops(7), Home(8), Leave(6)
  • No(51)>Yes(21)
  • Below I’ve generated a simplistic “most significant” measure, by computing the ratio of the word frequency in the debate transcript to the word frequency in a corpus of spoken English. The number in the first table is the ratio, whereas the number in the second table is the word count. I’m going to get this in Exhibit and have a play with some neato visualizations as soon as I get a chance.

    Here are the caveats: The word frequencies I’m using in the first table are actually from the British National Corpus, so many of the words in the speech transcript are highly represented simply because they’re speaking American English, and in the second table, using the ANC Corpus, many words are highly represented because they represent current events that hadn’t occurred when the corpus was compiled and because debates are part speeches, which are more like written English. I’ll update it when I find a better reference.

    Top 100 Words in GOP Primary Debate 2007

    Top 100 Words in GOP Primary Debate 2007
    sorted by appearance ratio

    GOVERNOR 589.4664
    IRAN 335.5764
    IRAQ 174.853
    CLINTON 137.7629
    FEDERAL 90.07577
    TAXES 84.77719
    COALITION 79.47862
    STATES 74.18005
    READER 74.18005
    BUSH 70.64766
    CALIFORNIA 61.8167
    AMERICANS 61.1374
    BORDER 57.06157
    NATION 56.51813
    BELIEFS 52.98575
    DIPLOMATIC 52.98575
    EXPORTS 52.98575
    MILITARY 49.86894
    UNITED 47.09844
    SPENDING 46.07456
    WASHINGTON 45.41635
    DEMOCRATS 44.15479
    PROGRAM 39.73931
    SECURE 39.73931
    ACQUIRE 39.73931
    DEFEAT 39.73931
    KOREA 39.73931
    NUCLEAR 39.04213
    WEAPONS 38.53509
    PRESIDENT 38.34495
    MAYOR 37.84696
    ISRAEL 37.09002
    ACQUISITION 35.32383
    AUTHORS 35.32383
    JOURNAL 35.32383
    LIMITATIONS 35.32383
    PRESIDENTIAL 35.32383
    TROOPS 33.7182
    FOREIGN 33.3614
    GLOBAL 33.11609
    AMERICA 32.99113
    AMERICAN 32.75483
    VALUES 31.79145
    GAINS 31.79145
    SUPREME 31.79145
    GREATEST 29.80448
    SERVING 29.43653
    FAITH 26.49287
    ENTIRE 26.49287
    CONSTITUTION 26.49287
    ILLEGAL 26.49287
    CATHOLIC 26.49287
    COMMANDER 26.49287
    ACCOMPANIED 26.49287
    JUDICIAL 26.49287
    VIEWED 26.49287
    WALKER 26.49287
    PROTECT 25.43316
    THREAT 24.93447
    CELLS 24.45496
    GRADE 24.08443
    CANDIDATES 23.54922
    WEAPON 23.54922
    CONCERNING 22.70818
    JUDGES 22.70818
    BILLS 22.30979
    ELECTED 22.30979
    WELFARE 22.07739
    STABILITY 21.1943
    ADMINISTRATION 20.60557
    FORMER 20.18505
    DEFICIT 19.86966
    CELL 19.52106
    SOLVE 18.92348
    VOTED 18.92348
    PROUD 18.54501
    LEAD 18.1267
    MIDDLE 18.1267
    TAX 17.82698
    CRITICAL 17.66192
    FREEDOM 17.66192
    CONSISTENT 17.66192
    PRINCIPLES 17.66192
    TRANSFER 17.66192
    EXPERIMENT 17.66192
    INTELLIGENCE 17.66192
    ROMAN 17.66192
    SUCCEED 17.66192
    DISCRETION 17.66192
    ENEMY 17.66192
    STUDIED 17.66192
    WEALTH 17.66192
    COLLAPSE 17.66192
    CONCLUDED 17.66192
    CONVICTION 17.66192
    HUMANS 17.66192
    PAKISTAN 17.66192
    REVEAL 17.66192
    SEPARATION 17.66192
    WIN 16.89401

    Top 50 Words using ANC Corpus
    sorted by appearance ratio

    9 KARL
    7 OPTIMISM
    21 RONALD
    4 CONFRONT
    4 SCOOTER
    9 ISLAMIC
    3 ALTERED
    3 CONSERVATIVES
    3 VETOED
    3 DIPLOMATIC
    7 REGIMES
    7 REPEAL
    12 STEM
    4 AISLE
    4 HYDE
    4 STRENGTHS
    2 BATTALIONS
    2 CURES
    2 FLATTER
    2 GOVERNS
    2 HOSTILITY
    2 JUSTICES
    2 NOMINEE
    2 PARDONS
    2 SECRECY
    2 UNIFY
    2 EXPORTS
    6 CELLS
    6 ENGAGE
    3 CONSERVATISM
    3 CRITICIZED
    15 ID
    4 IRANIANS
    5 BIN
    5 STRENGTHEN
    5 CELEBRATE
    4 MISMANAGED
    3 RACISM
    2 ABORTIONS
    2 COMMUNION
    2 CONVEY
    2 CROSSES
    2 CURING
    2 ENDORSE
    2 GOVERNED
    2 IMPERATIVE
    2 TAMPER

    About Mr. Gunn

    Science, Scholarly Communication, and Mendeley

    06. May 2007 by Mr. Gunn
    Categories: blogging, current events, Politics, Statistics | Tags: , , , , , , , , , , | Leave a comment

    Leave a Reply

    Required fields are marked *