两个经典问题:矩阵乘法,文档倒排索引
矩阵乘法
背景
并行化矩阵乘法是可以用MapReduce实现的一项基本算法,最早Google公司将MapReduce引入到工业界的原因就是要解决PageRank中包含的大量矩阵运算。
原理
假设:矩阵,矩阵
,
的列数
的行数,则记
和
的乘积
。
其中,记作矩阵M的第i行第j列元素,
记作矩阵N的第j行第k列元素,则乘积矩阵P的元素为
%7Bik%7D%20%3D%20%E2%88%91%F0%9D%91%97%F0%9D%91%9A%7B%F0%9D%91%96%F0%9D%91%97%7D%F0%9D%91%9B%7B%F0%9D%91%97%F0%9D%91%98%7D%3C%2Ftitle%3E%0A%3Cdefs%20aria-hidden%3D%22true%22%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-70%22%20d%3D%22M23%20287Q24%20290%2025%20295T30%20317T40%20348T55%20381T75%20411T101%20433T134%20442Q209%20442%20230%20378L240%20387Q302%20442%20358%20442Q423%20442%20460%20395T497%20281Q497%20173%20421%2082T249%20-10Q227%20-10%20210%20-4Q199%201%20187%2011T168%2028L161%2036Q160%2035%20139%20-51T118%20-138Q118%20-144%20126%20-145T163%20-148H188Q194%20-155%20194%20-157T191%20-175Q188%20-187%20185%20-190T172%20-194Q170%20-194%20161%20-194T127%20-193T65%20-192Q-5%20-192%20-24%20-194H-32Q-39%20-187%20-39%20-183Q-37%20-156%20-26%20-148H-6Q28%20-147%2033%20-136Q36%20-130%2094%20103T155%20350Q156%20355%20156%20364Q156%20405%20131%20405Q109%20405%2094%20377T71%20316T59%20280Q57%20278%2043%20278H29Q23%20284%2023%20287ZM178%20102Q200%2026%20252%2026Q282%2026%20310%2049T356%20107Q374%20141%20392%20215T411%20325V331Q411%20405%20350%20405Q339%20405%20328%20402T306%20393T286%20380T269%20365T254%20350T243%20336T235%20326L232%20322Q232%20321%20229%20308T218%20264T204%20212Q178%20106%20178%20102Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-69%22%20d%3D%22M184%20600Q184%20624%20203%20642T247%20661Q265%20661%20277%20649T290%20619Q290%20596%20270%20577T226%20557Q211%20557%20198%20567T184%20600ZM21%20287Q21%20295%2030%20318T54%20369T98%20420T158%20442Q197%20442%20223%20419T250%20357Q250%20340%20236%20301T196%20196T154%2083Q149%2061%20149%2051Q149%2026%20166%2026Q175%2026%20185%2029T208%2043T235%2078T260%20137Q263%20149%20265%20151T282%20153Q302%20153%20302%20143Q302%20135%20293%20112T268%2061T223%2011T161%20-11Q129%20-11%20102%2010T74%2074Q74%2091%2079%20106T122%20220Q160%20321%20166%20341T173%20380Q173%20404%20156%20404H154Q124%20404%2099%20371T61%20287Q60%20286%2059%20284T58%20281T56%20279T53%20278T49%20278T41%20278H27Q21%20284%2021%20287Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-6B%22%20d%3D%22M121%20647Q121%20657%20125%20670T137%20683Q138%20683%20209%20688T282%20694Q294%20694%20294%20686Q294%20679%20244%20477Q194%20279%20194%20272Q213%20282%20223%20291Q247%20309%20292%20354T362%20415Q402%20442%20438%20442Q468%20442%20485%20423T503%20369Q503%20344%20496%20327T477%20302T456%20291T438%20288Q418%20288%20406%20299T394%20328Q394%20353%20410%20369T442%20390L458%20393Q446%20405%20434%20405H430Q398%20402%20367%20380T294%20316T228%20255Q230%20254%20243%20252T267%20246T293%20238T320%20224T342%20206T359%20180T365%20147Q365%20130%20360%20106T354%2066Q354%2026%20381%2026Q429%2026%20459%20145Q461%20153%20479%20153H483Q499%20153%20499%20144Q499%20139%20496%20130Q455%20-11%20378%20-11Q333%20-11%20305%2015T277%2090Q277%20108%20280%20121T283%20145Q283%20167%20269%20183T234%20206T200%20217T182%20220H180Q168%20178%20159%20139T145%2081T136%2044T129%2020T122%207T111%20-2Q98%20-11%2083%20-11Q66%20-11%2057%20-1T48%2016Q48%2026%2085%20176T158%20471L195%20616Q196%20629%20188%20632T149%20637H144Q134%20637%20131%20637T124%20640T121%20647Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMAIN-3D%22%20d%3D%22M56%20347Q56%20360%2070%20367H707Q722%20359%20722%20347Q722%20336%20708%20328L390%20327H72Q56%20332%2056%20347ZM56%20153Q56%20168%2072%20173H708Q722%20163%20722%20153Q722%20140%20707%20133H70Q56%20140%2056%20153Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMAIN-28%22%20d%3D%22M94%20250Q94%20319%20104%20381T127%20488T164%20576T202%20643T244%20695T277%20729T302%20750H315H319Q333%20750%20333%20741Q333%20738%20316%20720T275%20667T226%20581T184%20443T167%20250T184%2058T225%20-81T274%20-167T316%20-220T333%20-241Q333%20-250%20318%20-250H315H302L274%20-226Q180%20-141%20137%20-14T94%20250Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-4D%22%20d%3D%22M289%20629Q289%20635%20232%20637Q208%20637%20201%20638T194%20648Q194%20649%20196%20659Q197%20662%20198%20666T199%20671T201%20676T203%20679T207%20681T212%20683T220%20683T232%20684Q238%20684%20262%20684T307%20683Q386%20683%20398%20683T414%20678Q415%20674%20451%20396L487%20117L510%20154Q534%20190%20574%20254T662%20394Q837%20673%20839%20675Q840%20676%20842%20678T846%20681L852%20683H948Q965%20683%20988%20683T1017%20684Q1051%20684%201051%20673Q1051%20668%201048%20656T1045%20643Q1041%20637%201008%20637Q968%20636%20957%20634T939%20623Q936%20618%20867%20340T797%2059Q797%2055%20798%2054T805%2050T822%2048T855%2046H886Q892%2037%20892%2035Q892%2019%20885%205Q880%200%20869%200Q864%200%20828%201T736%202Q675%202%20644%202T609%201Q592%201%20592%2011Q592%2013%20594%2025Q598%2041%20602%2043T625%2046Q652%2046%20685%2049Q699%2052%20704%2061Q706%2065%20742%20207T813%20490T848%20631L654%20322Q458%2010%20453%205Q451%204%20449%203Q444%200%20433%200Q418%200%20415%207Q413%2011%20374%20317L335%20624L267%20354Q200%2088%20200%2079Q206%2046%20272%2046H282Q288%2041%20289%2037T286%2019Q282%203%20278%201Q274%200%20267%200Q265%200%20255%200T221%201T157%202Q127%202%2095%201T58%200Q43%200%2039%202T35%2011Q35%2013%2038%2025T43%2040Q45%2046%2065%2046Q135%2046%20154%2086Q158%2092%20223%20354T289%20629Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMAIN-22C5%22%20d%3D%22M78%20250Q78%20274%2095%20292T138%20310Q162%20310%20180%20294T199%20251Q199%20226%20182%20208T139%20190T96%20207T78%20250Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-4E%22%20d%3D%22M234%20637Q231%20637%20226%20637Q201%20637%20196%20638T191%20649Q191%20676%20202%20682Q204%20683%20299%20683Q376%20683%20387%20683T401%20677Q612%20181%20616%20168L670%20381Q723%20592%20723%20606Q723%20633%20659%20637Q635%20637%20635%20648Q635%20650%20637%20660Q641%20676%20643%20679T653%20683Q656%20683%20684%20682T767%20680Q817%20680%20843%20681T873%20682Q888%20682%20888%20672Q888%20650%20880%20642Q878%20637%20858%20637Q787%20633%20769%20597L620%207Q618%200%20599%200Q585%200%20582%202Q579%205%20453%20305L326%20604L261%20344Q196%2088%20196%2079Q201%2046%20268%2046H278Q284%2041%20284%2038T282%2019Q278%206%20272%200H259Q228%202%20151%202Q123%202%20100%202T63%202T46%201Q31%201%2031%2010Q31%2014%2034%2026T39%2040Q41%2046%2062%2046Q130%2049%20150%2085Q154%2091%20221%20362L289%20634Q287%20635%20234%20637Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMAIN-29%22%20d%3D%22M60%20749L64%20750Q69%20750%2074%20750H86L114%20726Q208%20641%20251%20514T294%20250Q294%20182%20284%20119T261%2012T224%20-76T186%20-143T145%20-194T113%20-227T90%20-246Q87%20-249%2086%20-250H74Q66%20-250%2063%20-250T58%20-247T55%20-238Q56%20-237%2066%20-225Q221%20-64%20221%20250T66%20725Q56%20737%2055%20738Q55%20746%2060%20749Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJSZ2-2211%22%20d%3D%22M60%20948Q63%20950%20665%20950H1267L1325%20815Q1384%20677%201388%20669H1348L1341%20683Q1320%20724%201285%20761Q1235%20809%201174%20838T1033%20881T882%20898T699%20902H574H543H251L259%20891Q722%20258%20724%20252Q725%20250%20724%20246Q721%20243%20460%20-56L196%20-356Q196%20-357%20407%20-357Q459%20-357%20548%20-357T676%20-358Q812%20-358%20896%20-353T1063%20-332T1204%20-283T1307%20-196Q1328%20-170%201348%20-124H1388Q1388%20-125%201381%20-145T1356%20-210T1325%20-294L1267%20-449L666%20-450Q64%20-450%2061%20-448Q55%20-446%2055%20-439Q55%20-437%2057%20-433L590%20177Q590%20178%20557%20222T452%20366T322%20544L56%20909L55%20924Q55%20945%2060%20948Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-6A%22%20d%3D%22M297%20596Q297%20627%20318%20644T361%20661Q378%20661%20389%20651T403%20623Q403%20595%20384%20576T340%20557Q322%20557%20310%20567T297%20596ZM288%20376Q288%20405%20262%20405Q240%20405%20220%20393T185%20362T161%20325T144%20293L137%20279Q135%20278%20121%20278H107Q101%20284%20101%20286T105%20299Q126%20348%20164%20391T252%20441Q253%20441%20260%20441T272%20442Q296%20441%20316%20432Q341%20418%20354%20401T367%20348V332L318%20133Q267%20-67%20264%20-75Q246%20-125%20194%20-164T75%20-204Q25%20-204%207%20-183T-12%20-137Q-12%20-110%207%20-91T53%20-71Q70%20-71%2082%20-81T95%20-112Q95%20-148%2063%20-167Q69%20-168%2077%20-168Q111%20-168%20139%20-140T182%20-74L193%20-32Q204%2011%20219%2072T251%20197T278%20308T289%20365Q289%20372%20288%20376Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-6D%22%20d%3D%22M21%20287Q22%20293%2024%20303T36%20341T56%20388T88%20425T132%20442T175%20435T205%20417T221%20395T229%20376L231%20369Q231%20367%20232%20367L243%20378Q303%20442%20384%20442Q401%20442%20415%20440T441%20433T460%20423T475%20411T485%20398T493%20385T497%20373T500%20364T502%20357L510%20367Q573%20442%20659%20442Q713%20442%20746%20415T780%20336Q780%20285%20742%20178T704%2050Q705%2036%20709%2031T724%2026Q752%2026%20776%2056T815%20138Q818%20149%20821%20151T837%20153Q857%20153%20857%20145Q857%20144%20853%20130Q845%20101%20831%2073T785%2017T716%20-10Q669%20-10%20648%2017T627%2073Q627%2092%20663%20193T700%20345Q700%20404%20656%20404H651Q565%20404%20506%20303L499%20291L466%20157Q433%2026%20428%2016Q415%20-11%20385%20-11Q372%20-11%20364%20-4T353%208T350%2018Q350%2029%20384%20161L420%20307Q423%20322%20423%20345Q423%20404%20379%20404H374Q288%20404%20229%20303L222%20291L189%20157Q156%2026%20151%2016Q138%20-11%20108%20-11Q95%20-11%2087%20-5T76%207T74%2017Q74%2030%20112%20181Q151%20335%20151%20342Q154%20357%20154%20369Q154%20405%20129%20405Q107%20405%2092%20377T69%20316T57%20280Q55%20278%2041%20278H27Q21%20284%2021%20287Z%22%3E%3C%2Fpath%3E%0A%3Cpath%20stroke-width%3D%221%22%20id%3D%22E1-MJMATHI-6E%22%20d%3D%22M21%20287Q22%20293%2024%20303T36%20341T56%20388T89%20425T135%20442Q171%20442%20195%20424T225%20390T231%20369Q231%20367%20232%20367L243%20378Q304%20442%20382%20442Q436%20442%20469%20415T503%20336T465%20179T427%2052Q427%2026%20444%2026Q450%2026%20453%2027Q482%2032%20505%2065T540%20145Q542%20153%20560%20153Q580%20153%20580%20145Q580%20144%20576%20130Q568%20101%20554%2073T508%2017T439%20-10Q392%20-10%20371%2017T350%2073Q350%2092%20386%20193T423%20345Q423%20404%20379%20404H374Q288%20404%20229%20303L222%20291L189%20157Q156%2026%20151%2016Q138%20-11%20108%20-11Q95%20-11%2087%20-5T76%207T74%2017Q74%2030%20112%20180T152%20343Q153%20348%20153%20366Q153%20405%20129%20405Q91%20405%2066%20305Q60%20285%2060%20284Q58%20278%2041%20278H27Q21%20284%2021%20287Z%22%3E%3C%2Fpath%3E%0A%3C%2Fdefs%3E%0A%3Cg%20stroke%3D%22currentColor%22%20fill%3D%22currentColor%22%20stroke-width%3D%220%22%20transform%3D%22matrix(1%200%200%20-1%200%200)%22%20aria-hidden%3D%22true%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-70%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(503%2C-150)%22%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-69%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6B%22%20x%3D%22345%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMAIN-3D%22%20x%3D%221494%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMAIN-28%22%20x%3D%222550%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-4D%22%20x%3D%222940%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMAIN-22C5%22%20x%3D%224213%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-4E%22%20x%3D%224714%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(5603%2C0)%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMAIN-29%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(389%2C-150)%22%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-69%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6B%22%20x%3D%22345%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3C%2Fg%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMAIN-3D%22%20x%3D%226983%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(8039%2C0)%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJSZ2-2211%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6A%22%20x%3D%222042%22%20y%3D%22-688%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3Cg%20transform%3D%22translate(10042%2C0)%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-6D%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(878%2C-150)%22%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-69%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6A%22%20x%3D%22345%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3C%2Fg%3E%0A%3Cg%20transform%3D%22translate(11557%2C0)%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-6E%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3Cg%20transform%3D%22translate(600%2C-150)%22%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6A%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMATHI-6B%22%20x%3D%22412%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3C%2Fg%3E%0A%3C%2Fg%3E%0A%3C%2Fsvg%3E#card=math&code=%0Ap%7Bik%7D%20%3D%20%28M%C2%B7%20N%29%7Bik%7D%20%3D%20%E2%88%91%F0%9D%91%97%F0%9D%91%9A%7B%F0%9D%91%96%F0%9D%91%97%7D%F0%9D%91%9B_%7B%F0%9D%91%97%F0%9D%91%98%7D&id=Fv5v4)
举例

MapReduce实现矩阵乘法
基本思路:将矩阵M和矩阵N中的元素分别以键值对的形式分发给不同的计算任务,保证与计算P中某个元素相关的所有元素信息发送至同一计算任务(Map过程);在每个任务中识别同一计算的相关元素,进行先乘后加的计算(Reduce过程) 。
MAP函数
- 输入:矩阵M,矩阵N,M的行数列数,N的行数列数
输出:矩阵M和矩阵N中元素的键值对(注意:为了保证Reduce阶段能够识别矩阵M和矩阵N中的元素,并进行正确的乘法计算,需输出所有相关的数据,如mij,njk,i,j,k)
Reduce函数
输入:Map函数的输出键值对,M的行数列数,N的行数列数
输出:矩阵P中元素的键值对
注意:为了正确计算P中的每个元素pik,需找到与之相关的所有元素mij和njk,其中j=1…M的列数或N的行数,再进行乘法计算。
文档倒排索引
背景
倒排索引是目前几乎所有支持全文检索的搜索引擎都需要依赖的一个数据结构,该索引结构用来存储某个单词在一个文档或一组文档中存储位置的映射,提供了一种根据内容来查找文档的方式。
简易文档倒排索引:原理

简易文档倒排索引:Mapreduce实现
改进WordCount代码:
Map函数输出单词,文件名@位置,Reduce函数输出单词,list<文件名@位置>即可。
真实文档倒排索引:原理

真实文档倒排索引:mapreduce实现
键值对<单词,文件名>作为Reduce函数的输入是否可行?不行,考虑数据量大时的网络带宽。
是否需要考虑清洗一部分明显不合理的单词的统计?定义StopWords列表。
如果使用了比较复杂的类型,如(Text,Text)形式的结构作为Map函数的输出键值对的主键,是否需要考虑修改Partition函数使得所有相关的主键分配到同一个Reducer?
