Category Archives: Uncategorized

Data Stories podcast on Text Visualization

Distant reading meets data visualization:  The most recent episode of the Data Stories podcast interviews Chris Collins of the University of Ontario Institute of Technology about his work on text visualization. The conversation was helpful for me to understand further applications of this kind of text analysis — e.g. to recognize patterns in login passwords, or to explore how false information spreads on Twitter. Plus Franco Moretti gets a shout-out halfway through.

I might be biased, since I’ve recently started helping to produce this podcast (though not this particular episode), but I think the show is worth checking out in general. Many episodes are relevant to the concerns of our course.

Topic Modeling (Goldstone and Underwood readings)

In reference to the reading on Topic Modeling, I’m interested in how this method might be useful in regards to pulling data from fashion publications (Vogue, Harper’s Bazaar, W etc.) Topic Modeling is defined as a form of text mining, a way of identifying patterns in a corpus. You take your corpus which groups words across the corpus into “topics.” We’ll say that the reoccurring words or “topics” are trends. If I’m applying this to my fashion research, then trends could be searched literally across 5 years worth of words collected in a magazine. What are the “trends” (words and actual style trends) that occur the most frequently from 1960 to 1965? 1975 to 1980? Were there specific topics being talked about in these chunks of time that happened to reoccur? In these chunks of time did the topics vary by publication? Did the language stay the same? I am interested in pulling data from publications, but I’m not sure if I want to pull text data or photo data (as in, maybe, how many times in a decades worth of issues from Vogue were black models featured in editorials).

We’ll see.

Scarlett

Poetry, Appropriation, and the “Avant-Garde”

For my data project (and perhaps leading into the final project), I’m interested in finding a way to map, graph, or visualize a set of linguistic / formal trends in contemporary poetry. Based on a set of inter-connected issues and ideas, I’ve arrived at the following question, which is probably still too large: what is the relationship between “appropriation,” race, and gender in poetry of the “avant-garde”?

In coming up with this question, my first idea was to use digital tools to see how certain words or trends in language have been appropriated (or repurposed) in poetry. How much is “creative” (i.e., original) and how much is “uncreative” (i.e., stolen), and where is the line between the two?

As a writer who has almost always used other texts to generate my own, I know the politics, practice, and implications of this question are complicated. It’s certainly not a question that can or should be cleanly “solved,” but perhaps that makes it fertile for a digital project. And a number of recent course readings (e.g.,” Topic Modeling and Figurative Language”), workshops (Web Scraping Social Media) and blog posts (Matt’s on “Poemage”; Taylor’s on “Hypergraphy”) have lead me to believe there are some tools I could explore to attempt this kind of “close”-and-“distant” reading.

But what do I mean by “language,” “appropriation,” and “contemporary poetry”? Which text(s) do I want to analyze, and how?

I thought of looking at poems in the most recent issue of Best American Poetry, if only because of the controversy surrounding a particular poem by a white man named Michael Derrick Hudson in that anthology. Hudson submitted his poem 40 times under his own name, and then 9 times under the pseudonym of Yi-Fen Chou, hoping that by posing as a Chinese woman (“appropriating” a particular name and identity), his poem would be accepted. And eventually, it was. According to Hudson, “it took quite a bit of effort to get (this poem) into print, but I’m nothing if not persistent.”

(The idea of “persistence” lead me to a (possibly) related question: how often do men, women, and POC submit the same piece of writing for publication, and how often are they published? Is this even a type of data that I could find?)

On a language level, an analysis of “appropriation” in a selection of texts could look something like this: take a set of poems (perhaps from that same issue of Best American Poetry), and use tools like Poemage or Topic Modeling to identify certain language trends: words, phrases, or perhaps bigger-picture patterns, like syntax, formal constraints, or rhyme. Then, scrape the web to see how and where these language and formal trends have been used prior to these poems: in literature, and / or in other places (blogs, social media, etc.). And if this dataset is too unmanageable, perhaps just look for how language from non-literary texts gets “appropriated” into poetic texts. This doesn’t relate to “appropriated” language to race and gender yet, but I’m getting there.

Another idea I had for the dataset (which I originally thought of as separate, but now seems related), was to use language-analysis to ask: what “is” (or marks) the “avant-garde” in poetry?

It’s hard, and perhaps silly, to try to define or locate a set of poems that are somehow representative of the “avant-garde,” which is itself a problematic term: most likely only historical, and not really in or of contemporary use. But the reason I thought of this question was my interest in an essay titled “Delusions of Whiteness in the Avant-Garde” by Cathy Park Hong, almost a year prior to the Michael Derrick Hudson case, in the journal Lana Turner – a journal “of poetry and opinion” in which I have published often, and which might also be thought of as a home for the “avant-garde.” In this essay, Hong claims that “to encounter the history of avant-garde poetry is to encounter a racist tradition.” A second dataset would then be in service of documenting and support this claim (and those that follow) through maps, graphs, or even hypertexts.

To do so, I might first take the poems in (lets say, that issue of Lana Turner), and use (lets say, “topic modeling”) to see if any words, syntax, or forms can be constituted into a pattern that might be used to define “the avant-garde” (or at least, the machine’s perception of it). To be more specific by borrowing some terms from Hong’s essay: what are the “radical languages and forms” that have been “usurped” (appropriated) without proper acknowledgement? What are “Eurocentric practices” in poetry these days? How can I use digital tools to further define these terms, and then map them against the race and gender of their authors? How does this information relate to the “persistence” with which (these poets) tend to submit their work? Is this part of the same dataset, or related?

The biggest issue with my proposal seems to be figuring out the scope of the project; how many and which texts to analyze. If I just use the most recent issues of Best American Poetry and / or Lana Turner, would I have enough, or too much data? And if I’m exploring these greater social issues, should I instead be mapping the controversies surrounding this discourse (on social media, for example), rather trying to analyze any particular text itself? How could I possibly choose texts that are representative of such large claims? One tentative thought I had was to analyze my own poems. This approach is appealing, not only because I’m most comfortable targeting myself, but also because it could offer the clearest dataset. That said, I hesitate to make this “critical” project about my own creative work, about which I may know or think too much, or at least, too much more than an algorithm or computer. And my poems are certainly not “representative” on their own.

That’s enough of this meandering post for now – – (I’m glad I can edit this) and welcome any thoughts or feedback – –

– – Sara

PS – and if this “dataset” proves too large or complicated with its various tools and politics (which I’m starting to think it very well might), another idea I have (not related!) is to analyze the language (again: words, syntax, and structural forms) that teachers (adjuncts and or full-time professors) use in their writing composition syllabi (lets say, within the CUNY network), as well as looking at the texts that they teach. I’m pretty sure that “official” student evaluations are made public (at CUNY, teachers need to agree to this) – but there is also the (problematic, but possibly useful) Rate My Professors, among other blogs and social media where student reactions might occur. It could be interesting to look at the relationship between how composition syllabi are written and how students perform and / or react. And I think this project could lead to the kind of “browsing” that Stephen Ramsey describes, where as my long, pervious proposal above might constitute too much of a “search,” and one that is overloaded, at that.

Other kinds of literary maps

Not as quantified and far more whimsical than Moretti’s maps, the drawings created by Andrew DeGraff (who’s described as a ‘pop cartographer’) cover a so-called atlas of literary maps. Check them out here, along with a time lapse video of his creative process. As he is quoted in the article, “These are maps for people who seek to travel beyond the lives and places they already know (or think they know). The goal here isn’t to become found, but only to become more lost.” I don’t know that Moretti would think his maps were that useful, and I don’t imagine that they would qualify as a DH project, but I bet DeGraff discovered these texts in new ways in the process of creating these visualizations. At the very least, he probably had to do some close reading.

The Museum of Modern Art

Hi everyone,

Searching around for interesting datasets to play with, trying to understand how DH can help my job, I found this interesting collection on GitHub about the MoMA.
I would be good file to play with for a project.

The Museum of Modern Art (MoMA) acquired its first artworks in 1929, the year it was established. Today, the Museum’s evolving collection contains almost 200,000 works from around the world spanning the last 150 years. The collection includes an ever-expanding range of visual expression, including painting, sculpture, printmaking, drawing, photography, architecture, design, film, and media and performance art.

MoMA is committed to helping everyone understand, enjoy, and use our collection. The Museum’s website features almost 60,000 artworks from nearly 10,000 artists. This research dataset contains more than 120,000 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not Curator Approved.”

At this time, the data is available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly.

Here is the link of the page to download the Excel format.

I hope this could be useful and interesting to open your horizons

https://github.com/MuseumofModernArt/collection

Nico

 

Graphs, Maps, Trees

I am a reader. Not exactly sure what it means by modern standards, but I call myself a reader because I like how books impact my brain. Be it a historical novel or Moretti’s Graphs, Maps, Trees, I savor words selected by the author and connections they create in my mind. Imagination is what I believe has been driving human civilization from the very beginning. To imagine is to be human. By reading, imagination ignites. I have a friend who prefers not to read interesting books while managing multiple projects at work since he knows it keeps his mind distracted to the point he hardly makes it to work. He says he loves to be in a different reality, but books sometimes imprison his brain in a very particular way. My friend confided to me that in his twenties he would sometimes call in sick and simply stay home with his book. It felt good, but also terrified him. Although I have never missed work because of a book, I often do not want to re-emerge from a different reality. The explanation to this phenomenon can be found in Moretti’s chapter three, Trees. Literature, especially novels, evolves and adjusts to the new realities of a certain generation. There is an innumerable variation of books to every person’s liking. If someone misses work because of a book, it means she found the one with the correct embranchment. Moretti’s tree schemes show there is an indefinite number of ramification in novels. If you were never enchanted by a novel you might not have found the right one for yourself yet. But if you did and you are totally immersed in a fictional world, maybe start analyzing why the book engages you this much. I think Maps, Moretti’s second chapter, would be the most helpful in this inquiry. Is it a circular or linear map of fictional reality that makes you want to come back? Having reduced the text to minimum elements, what geometric form arises by creating connections between the characters? Why is it happening and what meaning does it convey to you? Graphs, Maps, Trees provides its readers with an interesting analysis of connections between social/historical situation of a certain timeframe current to it novels, and with an interesting analysis of schemes novels unwittingly produce in human imagination. I think I received some of the answers from Moretti.

“Quantitative Formalism: An Experiment” and a related thought about pop music

A few years ago I did a lot of reading about algorithms and machine learning as it related to the arts and popular culture. What immediately sprang to mind when I read the opening of “Quantitative Formalism: An Experiment” was an article from WIRED in 2011 about at team from The University of Bristol that worked on developing an equation that could predict a hit song.

http://www.wired.com/2011/12/hit-potential-equation/

At the top of the article you’ll see a video that shows the “evolution of musical features” as they relate to hit songs. Since we will soon be considering ways to display results from our data sets I thought this might be interesting to take a look at.

The short article considers both the Bristol team’s work and other similar projects related to predicting the popularity of new pop music. While this is not scholarly work, I thought it was interesting to share and consider how this type of enquiry is being used outside of the academy.

 

Social Justice and the Digital Humanities

Hey all,

Just ran across a site that might be of interest to those interested in postcolonial, or more generally, social justice approaches to DH. Social Justice and the Digital Humanities is a site that emerged from one of the courses at HILT 2015 (Humanities Intensive Learning and Teaching). The three-day intensive course combined theory and praxis, similar to our class. Discussions included questions about: access, material conditions, method, ontologies and epistemologies. I like that much of the content are questions, followed by further readings/references. It also included a list of DH projects with a social justice approach in mind.

One project I thought was pretty cool is Map of Native American Tribes . Aron Carapella, a self-taught mapmaker, mapped out original names and locations of native american tribes before their contact with Europeans. This transformed the entire makeup of how we visualize the map of the United States.

I also found out about #transformdh, “an academic guerrilla movement seeking to (re)define capital-letter Digital Humanities as a force for transformative scholarship by collecting, sharing, and highlighting projects that push at its boundaries and work for social justice, accessibility, and inclusion”

-Maple

Clio, Mnemosyne and Internet

Cohen’s and Rosenzweig’s Introduction to Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, 2005 was a very interesting reading because it echoed with some texts I’ve read in the last few months.
In this post I would like to discuss two main assumptions on the potentiality of the web: infinite possibilities of recording everything, easy access to all information on the web.

Can we really record everything?
I thought yes. Then a few months ago I read this article in The New Yorker where the author tells the wonderful story of the Internet Archive and its extreme usefulness in tracking and recording the digital world. What made me understand this important task was a bunch of illuminating sentences:
“No one believes any longer, if anyone ever did, that “if it’s on the Web it must be true,” but a lot of people do believe that if it’s on the Web it will stay on the Web.”
– “Web pages don’t have to be deliberately deleted to disappear. Sites hosted by corporations tend to die with their hosts. When MySpace, GeoCities, and Friendster were reconfigured or sold, millions of accounts vanished.”
– “Facebook has been around for only a decade; it won’t be around forever. Twitter is a rare case: it has arranged to archive all of its tweets at the Library of Congress.”
– “The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: “Page Not Found.” This is known as “link rot,” and it’s a drag, but it’s better than the alternative. More often, you see an updated Web page; most likely the original has been overwritten.”
– “According to a 2014 study conducted at Harvard Law School, “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.”
– “Last month, a team of digital library researchers based at Los Alamos National Laboratory reported the results of an exacting study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot. It’s like trying to stand on quicksand.”

And as Cohen and Rosenzweig affirm, the dream of preservation ad infinitum is indeed a dream: “The current reality, however, is closer to the reverse of that—we are rapidly losing the digital present that is being created because no one has worked out a means of preserving it. The flipside of the flexibility of digital data is its seeming lack of durability—a second hazard on the road to digital history nirvana.”
We should not take for granted the storage power of the web. Therefore we should improve our efforts in archiving the digital. Not just for the sake of academic articles but mainly because much of our life is indeed on the web.

But then a second question, to which I really don’t have any answer, came up to my mind.
Do we have to record everything?
Should this be also a moral question? Is it right to record everything just because we can do it? If we try to make a parallel between a conversation among two friend in a bar, and a conversation among two friend on the FB wall of one of them, we can clearly see the difference of how something that was highly ephemeral becomes something that is highly easy to capture and preserve. Maybe this is not the right example, because things change over time and we are talking about two different spaces. But, the act of communication is the same.
Recently, in Europe there has been much discussion on the right to oblivion. What do you think about this?
As an historian, of course, I would love to have all the possibilities to reconstruct the past. But, we have to acknowledge the importance of oblivion in societies (much of history research is indeed constructed around the effort to understand social amnesia and the role of memory in defining individual and collective identities). Should we leave this right to the vagaries of history or people should have the right to rewrite their past?  Should oblivion even be a right?

Access
I guess that to this issue, there is alo a question of access. Much of these data, of personal information exchanged by people, is owned by private companies that use these same data to make more profit (see the well-establisehd practice of Terms of Service – By the way, here a wonderful tool to understand the different terms of service of various web services). In other words, it seems to me that when we talk about the web and the richness of its data we too quickly assume an easy access to these same data. As Jeremy Rifkin wrote in The Age of Access, in the next future, which is already now since the book was published in 2000, power will gravitate around those who control the access to information. (see also Cohen and Rosenzweig: “A more serious threat in digital media, which runs counter to its great virtues of accessibility and diversity, is the real potential for inaccessibility and monopoly.”)

Probably one effective answer is given by common creative and open sources projects (Internet Archive, Open Culture etc) and I would suggest that we need an education to open access. We should praise more open access projects rather than the new designs of Apple (hysterical obsession for 6-months pseudo new products) and Google (do we really need all this insistence on Google glasses? Another screen just in front or our eyes?).
What I’m trying to say is that maybe even in the digital world we need to buy less and share more in common owned virtual spaces.

These same issues are very clearly discussed also by Cohen and Rosenzweig: “open source should the slogan of academic and popular historians”. In other words, this is a strong call for more digital preservation. This book was published in 2005. Did something change in the meanwhile? In the last weeks we talked about the decreasing importance of digitalization projects and a growing insistence on “more complex” DH projects. Why is that? Is it because since 2005 the digitalization process reached an “acceptable” level, or because it has acquired a lower status in the academy, and in the emerging DH field, compared to more analytical DH projects?

Finally, here a wonderful example, according to me, of digital pedagogy in history and many other disciplines: https://www.youtube.com/watch?v=Yocja_N5s1I&list=PLBDA2E52FB1EF80C9

Best wishes,

Davide