Metadata¶

Metadata is information that describes and gives context to your data. In workflows, it helps track important details about samples, experimental conditions that can influence processing parameters. Metadata is sample-specific and helps tailor analyses to each dataset's unique characteristics.

Think of it like a library catalog: while books contain the actual content (raw data), the catalog cards provide essential information about each book - when it was published, who wrote it, where to find it (metadata). In Nextflow pipelines, metadata helps us:

Track sample-specific information throughout the workflow
Configure processes based on sample characteristics
Group related samples for joint analysis

We'll explore how to handle metadata in workflows. Starting with a simple samplesheet containing basic sample information, you'll learn how to:

Read and parse sample metadata from CSV files
Create and manipulate metadata maps
Add new metadata fields during workflow execution
Use metadata to customize process behavior

These skills will help you build more robust and flexible pipelines that can handle complex sample relationships and processing requirements.

Let's dive in!

0. Warmup¶

0.1 Prerequisites¶

Before taking on this side quest you should:

Complete the Hello Nextflow tutorial
Understand basic Nextflow concepts (processes, channels, operators)

0.2 Starting Point¶

Let's move into the project directory.

cd side-quests/metadata

You'll find a data directory containing a samplesheet and a main workflow file.

Directory contents

> tree
.
├── data
│   ├── bonjour.txt
│   ├── ciao.txt
│   ├── guten_tag.txt
│   ├── hallo.txt
│   ├── hello.txt
│   ├── hola.txt
│   ├── salut.txt
│   └── samplesheet.csv
├── main.nf
└── nextflow.config

The samplesheet contains information about different samples and some associated data that we will use in this exercise to tailor our analysis to each sample. Each data files contains greetings in different languages, but we don't know what language they are in. In particular, the samplesheet has 3 columns:

id: self-explanatory, an ID given to the sample
character: a character name, that we will use later to draw different creatures
data: paths to .txt files that contain phrases in different languages

samplesheet.csv

id,character,recording
sampleA,squirrel,/workspaces/training/side-quests/metadata/data/bonjour.txt
sampleB,tux,/workspaces/training/side-quests/metadata/data/guten_tag.txt
sampleC,sheep,/workspaces/training/side-quests/metadata/data/hallo.txt
sampleD,turkey,/workspaces/training/side-quests/metadata/data/hello.txt
sampleE,stegosaurus,/workspaces/training/side-quests/metadata/data/hola.txt
sampleF,moose,/workspaces/training/side-quests/metadata/data/salut.txt
sampleG,turtle,/workspaces/training/side-quests/metadata/data/ciao.txt

1. Read in samplesheet¶

1.1. Read in samplesheet with splitCsv¶

Let's start by reading in the samplesheet with splitCsv. In the main workflow file, you'll see that we've already started the workflow.

main.nf
workflow  {

    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                            .splitCsv(header: true)

}

Note

Throughout this tutorial, we'll use the ch_ prefix for all channel variables to clearly indicate they are Nextflow channels.

AfterBefore

main.nf
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                        .splitCsv(header: true)
                        .view()

main.nf
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")

We can use the splitCsv operator to split the samplesheet into a channel of maps, where each map represents a row from the CSV file.

The header: true option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the view operator.

Run the pipeline:

Read the samplesheet

nextflow run main.nf

Read samplesheet with splitCsv

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [exotic_albattani] DSL2 - revision: c0d03cec83

[id:sampleA, character:squirrel, recording:/workspaces/training/side-quests/metadata/data/bonjour.txt]
[id:sampleB, character:tux, recording:/workspaces/training/side-quests/metadata/data/guten_tag.txt]
[id:sampleC, character:sheep, recording:/workspaces/training/side-quests/metadata/data/hallo.txt]
[id:sampleD, character:turkey, recording:/workspaces/training/side-quests/metadata/data/hello.txt]
[id:sampleE, character:stegosaurus, recording:/workspaces/training/side-quests/metadata/data/hola.txt]
[id:sampleF, character:moose, recording:/workspaces/training/side-quests/metadata/data/salut.txt]
[id:sampleG, character:turtle, recording:/workspaces/training/side-quests/metadata/data/ciao.txt]

We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. For example:

// Groovy map
def meta = [id:'sample1', character:'squirrel']
println meta.id  // Prints: sample1

# Python equivalent dictionary
meta = {'id': 'sample1', 'character': 'squirrel'}
print(meta['id'])  # Prints: sample1

Each map entry corresponds to a column:

id
character
data

This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with id or the txt file path with data. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic.

1.2 Separate meta data and data¶

In the samplesheet, we have both the input files and data about the input files (id, character), the meta data. As we progress through the workflow, we generate more meta data about each sample. To avoid having to keep track of how many fields we have at any point in time and making the input of our processes more robust, we can combine the meta information into its own key-value paired map.

This separation makes it easier to:

Track sample information throughout the workflow
Add new metadata as you process samples
Keep process inputs/outputs clean and organized
Query and filter samples based on their properties

Now let's use this and separate our metadata from the file path. We'll use the map operator to restructure our channel elements into a tuple consisting of the meta map and file:

AfterBefore

main.nf
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                        .splitCsv(header: true)
                        .map { row ->
                          [ [id:row.id, character:row.character], row.recording ]
                        }
                        .view()

main.nf
ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                        .splitCsv(header: true)
                        .view()

Let's run it:

View meta map

nextflow run main.nf

View meta map

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [lethal_booth] DSL2 - revision: 0d8f844c07

[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt]
[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt]
[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt]
[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt]
[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt]
[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt]
[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt]

We have successfully separate our meta data into its own map to keep it next to the file data. Each of our channel elements now has the shape [ meta, file ].

Takeaway¶

In this section, you've learned:

Reading in a samplesheet: Using splitCsv to read CSV files with header information and transform rows into structured data
Creating a meta map: Separating metadata from file data using tuple structure [ [id:value, ...], file ]

2. Create new meta map keys¶

2.1 Passing the meta map through a process¶

Now we want to process our samples. These samples are language samples, but we don't know what language they are in. Let's add a process definition before the workflow that can identify the language in each file:

AfterBefore

main.nf
/*
 * Use langid to predict the language of each input file
 */
process IDENTIFY_LANGUAGE {

    container 'community.wave.seqera.io/library/pip_langid:b2269f456a5629ff'

    input:
    tuple val(meta), path(file)

    output:
    tuple val(meta), stdout

    script:
    """
    langid < ${greeting} -l en,de,fr,es,it | sed -E "s/.*\\('([a-z]+)'.*/\\1/" | tr -d '\\n'
    """
}

main.nf
workflow  {
    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                .splitCsv(header: true)
                .map { row ->
                    [ [id:row.id, character:row.character], row.recording ]
                }
                .view()
}

The tool langid is a language identification tool. It is pre-trained on a set of languages. For a given phrase, it prints a language guess and a probability score for each guess to the console. In the script section, we are removing the probability score, clean up the string by removing a newline character and return the language guess. Since it is printed directly to the console, we are using Nextflow's stdout output qualifier, passing the string on as output.

Let's include the process, then run, and view it:

AfterBefore

main.nf
workflow  {

      ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

      ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)
      ch_prediction.view()

  }

main.nf
workflow  {

    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                .splitCsv(header: true)
                .map { row ->
                    [ [id:row.id, character:row.character], row.recording ]
                }
                .view()
}

Identify languages

nextflow run main.nf

Identify languages

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [voluminous_mcnulty] DSL2 - revision: f9bcfebabb

executor >  local (7)
[2c/888abb] IDENTIFY_LANGUAGE (7) [100%] 7 of 7 ✔
[[id:sampleA, character:squirrel], fr]
[[id:sampleB, character:tux], de]
[[id:sampleC, character:sheep], de]
[[id:sampleD, character:turkey], en]
[[id:sampleE, character:stegosaurus], es]
[[id:sampleF, character:moose], fr]
[[id:sampleG, character:turtle], it]

Neat, for each of our samples, we now have a language predicted. You may have noticed something else: we kept the meta data of our samples and associated it with our new piece of information. We achieved this by adding the meta map to the output tuple in the process:

main.nf
output:
      tuple val(meta), stdout

This is a useful way to ensure the sample-specific meta information stays connected with any new information that is generated.

2.2 Associate the language prediction with the input file¶

At the moment, our sample files and their language prediction are separated in two different channels: ch_samplesheet and ch_predictions. But both channels have the same meta information associated with the interesting data points. We can use the meta map to combine our channels back together.

Nextflow includes many methods for combining channels, but in this case the most appropriate operator is join. If you are familiar with SQL, it acts like the JOIN operation, where we specify the key to join on and the type of join to perform.

If we check the join documentation, we can see that it joins two channels based on a defined item, by default the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structures. If you removed the view() operator from the ch_samplesheet add it back in:

AfterBefore

main.nf
workflow  {

      ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }
                  .view()

      ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)
      ch_prediction.view()

  }

main.nf
workflow  {

    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                .splitCsv(header: true)
                .map { row ->
                    [ [id:row.id, character:row.character], row.recording ]
                }

    ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)
    ch_prediction.view()

}

View samplesheet and prediction channel content

nextflow run main.nf

View samplesheet and prediction channel content

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [trusting_blackwell] DSL2 - revision: de90745ea4

executor >  local (7)
[d6/0f2efd] IDENTIFY_LANGUAGE (7) [100%] 7 of 7 ✔
[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt]
[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt]
[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt]
[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt]
[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt]
[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt]
[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt]
[[id:sampleB, character:tux], de]
[[id:sampleA, character:squirrel], fr]
[[id:sampleC, character:sheep], de]
[[id:sampleD, character:turkey], en]
[[id:sampleE, character:stegosaurus], es]
[[id:sampleF, character:moose], fr]
[[id:sampleG, character:turtle], it]

We can see that the meta map is the first element in each map and the map is the same for both channels. We can simply use the join operator to combine the two channels:

AfterBefore

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                               .view()
}

main.nf
workflow  {

    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                .splitCsv(header: true)
                .map { row ->
                    [ [id:row.id, character:row.character], row.recording ]
                }
                .view()

    ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)
    ch_prediction.view()

}

View joined channel

nextflow run main.nf

View joined channel

[[id:sampleA, character:squirrel], /workspaces/training/side-quests/metadata/data/bonjour.txt, fr]
[[id:sampleB, character:tux], /workspaces/training/side-quests/metadata/data/guten_tag.txt, de]
[[id:sampleC, character:sheep], /workspaces/training/side-quests/metadata/data/hallo.txt, de]
[[id:sampleD, character:turkey], /workspaces/training/side-quests/metadata/data/hello.txt, en]
[[id:sampleE, character:stegosaurus], /workspaces/training/side-quests/metadata/data/hola.txt, es]
[[id:sampleF, character:moose], /workspaces/training/side-quests/metadata/data/salut.txt, fr]
[[id:sampleG, character:turtle], /workspaces/training/side-quests/metadata/data/ciao.txt, it]

It is becoming a bit hard to see, but if you look all the way on the right side, you can see that now each of our language predictions is associated with our input files.

Warning

The join operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter remainder: true to keep the unmatched tuples. Check the documentation for more details.

2.3 Add the language prediction to the meta map¶

Given that this is more data about the files, let's add it to our meta map. We can use the map operator to create a new key lang and set the value to the predicted language:

AfterBefore

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .view()

}

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                               .view()
}

View new meta map key

nextflow run main.nf -resume

View new meta map key

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [cheeky_fermat] DSL2 - revision: d096281ee4

[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔
[[id:sampleA, character:squirrel, lang:fr], /workspaces/training/side-quests/metadata/data/bonjour.txt]
[[id:sampleB, character:tux, lang:de], /workspaces/training/side-quests/metadata/data/guten_tag.txt]
[[id:sampleC, character:sheep, lang:de], /workspaces/training/side-quests/metadata/data/hallo.txt]
[[id:sampleD, character:turkey, lang:en], /workspaces/training/side-quests/metadata/data/hello.txt]
[[id:sampleE, character:stegosaurus, lang:es], /workspaces/training/side-quests/metadata/data/hola.txt]
[[id:sampleF, character:moose, lang:fr], /workspaces/training/side-quests/metadata/data/salut.txt]
[[id:sampleG, character:turtle, lang:it], /workspaces/training/side-quests/metadata/data/ciao.txt]

Nice, we expanded our meta map with new information we gathered in the pipeline. Let's take a look at what happened here:

After joining our channels, each element looks like this:

[meta, file, lang]  // e.g. [[id:sampleA, character:squirrel], bonjour.txt, fr]

The map operator takes each channel element and processes it to create a modified version. Inside the closure { meta, file, lang -> ... }, we then take the existing meta map, create a new map [lang:lang], and merge both together using +, Groovy's way of combining maps.

2.4 Assign a language group using a ternary operator¶

Alright, now that we have our language predictions, let's use the information to assign them into new groups. In our example data, we have provided data sets that belong either to germanic (either English or German) or romanic (French, Spanish, Italian) languages.

We can use the map operator and an ternary operator to assign either group. The ternary operator, is a short cut to an if/else clause. It says:

Ternary

variable = <condition> ? 'if-the-condition-is-true' : 'Default'

and is the same as:

If/else

if (<condition>){
  variable = 'if-the-condition-is-true'
} else {
  variable = 'Default'
}

AfterBefore

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .map{ meta, file ->
                                  def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
                                  [ meta + [lang_group:lang_group], file ]
                              }
                              .view()

}

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .view()

}

Let's rerun it

View language groups

nextflow run main.nf -resume

View language groups

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [wise_almeida] DSL2 - revision: 46778c3cd0

[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔
[[id:sampleA, character:squirrel, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/bonjour.txt]
[[id:sampleB, character:tux, lang:de, lang_group:germanic], /workspaces/training/side-quests/metadata/data/guten_tag.txt]
[[id:sampleC, character:sheep, lang:de, lang_group:germanic], /workspaces/training/side-quests/metadata/data/hallo.txt]
[[id:sampleD, character:turkey, lang:en, lang_group:germanic], /workspaces/training/side-quests/metadata/data/hello.txt]
[[id:sampleE, character:stegosaurus, lang:es, lang_group:romanic], /workspaces/training/side-quests/metadata/data/hola.txt]
[[id:sampleF, character:moose, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/salut.txt]
[[id:sampleG, character:turtle, lang:it, lang_group:romanic], /workspaces/training/side-quests/metadata/data/ciao.txt]

Let's understand how this transformation works. The map operator takes a closure that processes each element in the channel. Inside the closure, we're using a ternary operator to create a new language group classification.

The ternary expression (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic' works like this:

First, it evaluates the condition before the ?: checks if the language is either German ('de') or English ('en')
If the condition is true (language is German or English), it returns 'germanic'
If the condition is false (any other language), it returns 'romanic'

We store this result in the lang_group variable and then add it to our meta map using meta + [lang_group:lang_group]. The resulting channel elements maintain their [meta, file] structure, but the meta map now includes this new classification. This allows us to group samples by their language family later in the workflow.

Takeaway¶

In this section, you've learned:

Merging on meta maps: You used join to combine two channels based on their meta maps to maintain relationships across processes and channels
Creating custom keys: You created two new keys in your meta map, adding them with meta + [new_key:value]. One based on a computed value from a process, and one based on a condition you set in the map operator.
Ternary operator: You used the ternary operator to determine which language belongs to which group.

These allow you to associated new and existing meta data with files as you progress through your pipeline.

3. Filter data based on meta map values¶

We can use the filter operator to filter the data based on a condition. Let's say we only want to process romanic language samples further. We can do this by filtering the data based on the lang_group field. Let's create a new channel that only contains romanic languages and view it:

AfterBefore

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .map{ meta, file ->
                                  def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
                                  [ meta + [lang_group:lang_group], file ]
                              }

romanic_languages = ch_language_groups.filter { meta, file ->
                                        meta.lang_group == 'romanic'
                                      }
                                      .view()
}

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .map{ meta, file ->
                                  def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
                                  [ meta + [lang_group:lang_group], file ]
                              }
                              .view()

}

Let's rerun it:

View romanic samples

nextflow run main.nf -resume

View romanic samples

 N E X T F L O W   ~  version 24.10.4

Launching `main.nf` [drunk_brattain] DSL2 - revision: 453fdd4e91

[da/652cc6] IDENTIFY_LANGUAGE (7) [100%] 7 of 7, cached: 7 ✔
[[id:sampleA, character:squirrel, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/bonjour.txt]
[[id:sampleE, character:stegosaurus, lang:es, lang_group:romanic], /workspaces/training/side-quests/metadata/data/hola.txt]
[[id:sampleF, character:moose, lang:fr, lang_group:romanic], /workspaces/training/side-quests/metadata/data/salut.txt]
[[id:sampleG, character:turtle, lang:it, lang_group:romanic], /workspaces/training/side-quests/metadata/data/ciao.txt]

We have successfully filtered the data to only include romanic samples. Let's recap how this works. The filter operator takes a closure that is applied to each element in the channel. If the closure returns true, the element is included in the output channel. If the closure returns false, the element is excluded from the output channel.

In this case, we want to keep only the samples where meta.lang_group == 'romanic'. In the closure, we first know that our channel elements are all of shape [meta, file] and we can then access the individual keys of the meta map. We then check if meta.lang_group is equal to 'romanic'. If it is, the sample is included in the output channel. If it is not, the sample is excluded from the output channel.

main.nf
.filter { meta,file -> meta.lang_group == 'romanic' }

Takeaway¶

In this section, you've learned:

How to use filter to select samples based on metadata

We now have only the romanic language samples left and can process those further. Next we want to make characters say the phrases.

4. Customize a process with meta map¶

Let's let the characters say the phrases that we have passed in. In the hello-nextflow training, you already encountered the cowpy package, a python implementation of a tool called cowsay that generates ASCII art to display arbitrary text inputs in a fun way. We will re-use a process from there.

Copy in the process before your workflow block:

AfterBefore

main.nf
/*
 * Generate ASCII art with cowpy
*/
process COWPY {

    container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273'

    input:
    tuple val(meta), path(input_file)

    output:
    tuple val(meta), path("cowpy-${input_file}")

    script:
    """
    cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file}
    """

}

main.nf
workflow{
  ...
}

4.1 Add a custom publishing location¶

Let's run our romanic languages through COWPY and remove our view statement:

AfterBefore

main.nf
workflow  {

  ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                  .splitCsv(header: true)
                  .map { row ->
                      [ [id:row.id, character:row.character], row.recording ]
                  }

  ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

  ch_languages = ch_samplesheet.join(ch_prediction)
                              .map { meta, file, lang ->
                                  [ meta + [lang:lang], file ]
                              }
                              .map{ meta, file ->
                                  def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
                                  [ meta + [lang_group:lang_group], file ]
                              }

  romanic_languages = ch_languages.filter { meta, file ->
                                      meta.lang_group == 'romanic'
                                  }

  COWPY(romanic_languages)

}

main.nf
workflow  {

ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
                .splitCsv(header: true)
                .map { row ->
                    [ [id:row.id, character:row.character], row.recording ]
                }

ch_prediction = IDENTIFY_LANGUAGE(ch_samplesheet)

ch_languages = ch_samplesheet.join(ch_prediction)
                            .map { meta, file, lang ->
                                [ meta + [lang:lang], file ]
                            }
                            .map{ meta, file ->
                                def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
                                [ meta + [lang_group:lang_group], file ]
                            }

romanic_languages = ch_languages.filter { meta, file ->
                                    meta.lang_group == 'romanic'
                                }.view()
}

We are still missing a publishing location. Given we have been trying to figure out what languages our samples were in, let's group the samples by language in the output directory. Earlier, we added the predicted language to the meta map. We can access this key in the process and use it in the publishDir directive:

AfterBefore

main.nf
process COWPY {

    publishDir "results/${meta.lang}", mode: 'copy'

    container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273'

main.nf
process COWPY {

  container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273'

Let's run this:

Use cowpy

nextflow run main.nf

You should now see a new folder called results:

Results folder

results/
├── es
│   └── cowpy-hola.txt
├── fr
│   ├── cowpy-bonjour.txt
│   └── cowpy-salut.txt
└── it
    └── cowpy-ciao.txt

Success! All our phrases are correctly sorted and we now see which of them correspond to which language.

Let's take a look at cowpy-salut.txt:

cowpy-salut.txt

 ____________________
/ Salut, ça va?      \
\ à plus dans le bus /
 --------------------
\                             .       .
 \                           / `.   .' "
  \                  .---.  <    > <    >  .---.
   \                 |    \  \ - ~ ~ - /  /    |
         _____          ..-~             ~-..-~
        |     |   \~~~\.'                    `./~~~/
       ---------   \__/                        \__/
      .'  O    \     /               /       \  "
     (_____,    `._.'               |         }  \/~~~/
      `----.          /       }     |        /    \__/
            `-.      |       /      |       /      `. ,~~|
                ~-.__|      /_ - ~ ^|      /- _      `..-'
                     |     /        |     /     ~-.     `-. _  _  _
                     |_____|        |_____|         ~ - . _ _ _ _ _>

Look through the other files. All phrases should be spoken by the fashionable stegosaurus.

How did this work? The publishDir directive is evaluated at runtime when the process executes.Each process task gets its own meta map from the input tuple When the directive is evaluated, ${meta.lang} is replaced with the actual language value for that sample creating the dynamic paths like results/fr.

4.2 Customize the character¶

In our samplesheet, we have another column: character. To tailor the tool parameters per sample, we can also access information from the meta map in the script section. This is really useful in cases were a tool should have different parameters for each sample.

Let's customize the characters by changing the cowpy command:

AfterBefore

main.nf
script:
"""
cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file}
"""

main.nf
script:
"""
cat $input_file | cowpy -c "stegosaurus" > cowpy-${input_file}
"""

Let's run this:

Use cowpy

nextflow run main.nf -resume

and take another look at our french phrase:

fr/cowpy-salut.txt

 ____________________
/ Salut, ça va?      \
\ à plus dans le bus /
 --------------------
  \
   \   \_\_    _/_/
    \      \__/
           (oo)\_______
           (__)\       )\/\
               ||----w |
               ||     ||

This is a subtle difference to other parameters that we have set in the pipelines in previous trainings. A parameter that is passed as part of the params object is generally applied to all samples. When a more surgical approache is necessary, using the sample specific information is a good alternative.

Takeaway¶

In this section, you've learned:

Tweaking directives using meta values: Using meta map values in publishDir directives to create dynamic output paths based on sample properties
Tweaking script section based on meta values: Customizing tool parameters per sample using meta information in the script section

Summary¶

In this side quest, you've explored how to effectively work with metadata in Nextflow workflows. Here's what you've learned:

Reading and Structuring Metadata: Reading CSV files and creating organized metadata maps that stay associated with your data files
Expanding Metadata During Workflow: Adding new information to your metadata as your pipeline progresses by adding process outputs and deriving values through conditional logic
Joining based on Metadata: Using metadata to join process outputs and existing channels
Filtering Based on Metadata: Using metadata values to create specific subsets of your data
Customizing Process Behavior: Using metadata to adapt how processes handle different samples

This approach offers several advantages over hardcoding sample information:

Sample metadata stays associated with files throughout the workflow
Process behavior can be customized per sample
Output organization can reflect sample properties
Sample information can be expanded during pipeline execution

Key Concepts¶

Reading Samplesheets & creating meta maps

Channel.fromPath('samplesheet.csv')
  .splitCsv(header: true)
  .map { row ->
      [ [id:row.id, character:row.character], row.recording ]
  }

Adding new keys to the meta map
based on process output:

.map { meta, file, lang ->
  [ meta + [lang:lang], file ]
}

and using a conditional clause

.map{ meta, file ->
    def lang_group = (meta.lang.equals('de') || meta.lang.equals('en')) ? 'germanic' : 'romanic'
    [ meta + [lang_group:lang_group], file ]
}

Filtering on meta values

.filter { meta, file ->
  meta.lang_group == 'romanic'
}

Using meta values in Process Directives

publishDir "results/${meta.lang}", mode: 'copy'

Adapting tool parameters for individual samples

cat $input_file | cowpy -c ${meta.character} > cowpy-${input_file}

Metadata¶

0. Warmup¶

0.1 Prerequisites¶

0.2 Starting Point¶

1. Read in samplesheet¶

1.1. Read in samplesheet with splitCsv¶

1.2 Separate meta data and data¶

Takeaway¶

2. Create new meta map keys¶

2.1 Passing the meta map through a process¶

2.2 Associate the language prediction with the input file¶

2.3 Add the language prediction to the meta map¶

2.4 Assign a language group using a ternary operator¶

Takeaway¶

3. Filter data based on meta map values¶

Takeaway¶

4. Customize a process with meta map¶

4.1 Add a custom publishing location¶

4.2 Customize the character¶

Takeaway¶

Summary¶

Key Concepts¶

Resources¶